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Preface 



This volume contains the papers selected for presentation at the Second Interna- 
tional Conference on Rough Sets and Current Trends in Computing RSCTC 2000 
held in the beautiful Rocky Mountains resort town of Banff in Canada, October 
16-19, 2000. 

The main theme of the conference is centered around the theory of rough sets, 
its applications, and the theoretical development. The program also includes 
numerous research papers representing areas which are related to rough sets 
such as fuzzy sets, data mining, machine learning, pattern recognition, uncertain 
reasoning, neural nets, and genetic algorithms and some selected papers from 
other areas of computer science. This composition of various research areas is 
a reflection of the general philosophy of trying to bring together researchers 
representing different, but often closely related, research paradigms to enhance 
mutual understanding and to stimulate the exchange of ideas and sometimes of 
very diverse points of view on similar problems. 

The conference, the second in the series, stems from annual workshops de- 
voted to the topic of rough sets initiated in 1992 in Kiekrz, Poland (the second 
workshop was also held in Banff in 1993). The first conference was held in 1998 
in Warszawa, Poland, followed by the most recent workshop organized in Yama- 
guchi, Japan in 1999. 

R has been over twenty years now since the first introduction of basic ideas 
and definitions of rough set theory by Dr. Zdzislaw Pawlak. As with many other 
of Dr. Pawlak ’s ideas, the theory of rough sets now belongs to the standard voca- 
bulary of Computer Science research, in particular research related to uncertain 
reasoning, data mining, machine learning, pattern recognition, just to mention 
a few. 

In this context, one could ask the question as to what makes this theory so 
attractive in all these other areas which have already developed methodologies of 
their own. It seems that the universality of the theory is the keyword. It touches 
the very essence of the set definition, one of the fundamental notions of modern 
mathematics. The standard set theory is closely related to Boolean logic which 
in turn is at the heart of the operation of digital computers. 

It is well known that many practical problems cannot be solved satisfactorily 
by programming existing computers, in particular problems related to learning, 
pattern recognition, some forms of control etc. The difficulty stems from the fact 
that it is often impossible to create black-and-white algorithmic descriptions of 
the objects of interest occurring in different application areas, for example wave 
form patterns occurring in sound analysis. The theory of rough sets and its 
extensions provide rigorous mathematical techniques for creating approximate 
descriptions of such objects, for analyzing, optimizing, and recognizing the limits 
of what can be effectively distinguished (i.e. classified) by means of the available 
object representation. 
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This is not to say that all these difficult complex object classification-related 
problems would be automatically solved with the adoption of the rough set 
approach. Instead, the rough set approach provides a common philosophical 
framework supported by precise mathematical language for dealing with these 
problems. 

However, the details of specific solutions must be supplied by experts working 
in particular subject areas. Past experience indicates that the rough set approach 
is a team-oriented methodology. Usually a single individual does not have the 
expertise required for the effective application of the rough set approach to 
a practical problem. This means that developing practical applications of this 
methodology is difficult and costly. Prom the perspective of conference organizers 
it has also led to the relative rarity of application-oriented publications, and we 
would like to see more of these. 

This imbalance was visible in previous workshops and conferences on the sub- 
ject and is repeated here. We have a fine collection of good theoretical papers but 
application papers are few. Consequently, the proceedings are organized along 
the subject lines without further separation into theoretical versus practical pa- 
pers. We sincerely hope that the current strong theoretical growth of rough set 
approaches, as demonstrated in this volume, will eventually lead to the parallel 
growth in the application side resulting in stronger participation of industrial 
users of the methodology. 

The RSCTC 2000 program was further enhanced by invited keynote spea- 
kers: Setsuo Ohsuga, Zdzislaw Pawlak, and Lotfi A. Zadeh, and invited plenary 
speakers: Jerzy Grzymala-Busse, Roman Swiniarski, and Jan Zytkow. 

The success of RSCTC 2000 was a result of the joint efforts of authors. Advi- 
sory Board, Program Committee, and referees. We want to thank the authors for 
deciding to publish their research at this conference and for their patience during 
the delays which occured when processing the submissions. The preparation of 
this volume would not have been possible without the help of referees and the 
members of the Advisory Board to whom we would like to express our thanks 
and appreciation for the time and effort they put into the refereeing and pa- 
per selection process. In particular, we would like to thank Program Committee 
Chairs: Andrzej Skowron (Europe) and Shusaku Tsumoto (Asia) for their kind 
support and advice on conference-related issues. We are grateful to our spon- 
sors: the Faculty of Science, the President’s office, and the Computer Science 
Department, University of Regina for the financial and organizational support. 
We would also like to express our thanks to Ms. Rita Racette for the secretarial 
and organizational help before and during the conference. Much of the research 
on rough sets and other topics presented in this volume was supported by re- 
search grants from Natural Sciences and the Engineering Research Council of 
Canada. This support is gratefully acknowledged. 
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Rough Sets: Trends, Challenges, and Prospects 



Wojciech Ziarko 



Computer Science Department 
University of Regina 
Regina, Saskatchewan, S4S 0A2 
Canada 



Abstract. The article presents a brief review of the past and the current 
state of the rough set-related research and provides some ideas about the 
perspectives of rough set methodology in the context of its likely impact 
on the future computing devices. The opinions presented are solely of the 
author and do not necessarily reflect the point of view of the majority 
of the rough set community. 



The fundamentals of rough set theory have been laid out by Zdzislaw Pawlak 
about twenty years ago [1-3]. His original definition of the rough, or approxi- 
mately described set in terms of some, already known and well defined disjoint 
classes of indistinguishable objects, seemed to capture in mathematical terms 
the essence of limits of machine perception and empirical learning processes. 
The mathematical theory of rough sets which was developed around this def- 
inition, with contributions from many mathematicians and logicians (see, for 
example [5,6,12-18,33-35]), resembles the classical set theory in its clarity and 
algebraic completeness. The introduction of this theory in 1980’s coincided with 
the surge of interest in artificial intelligence (AI), machine learning, pattern 
recognition and expert systems. However, much of the research in those areas 
at that time was lacking comprehensive theoretical fundamentals, typically in- 
volving designing either classical logic-based or intuitive algorithms to deal with 
practical problems related to machine reasoning, perception or learning. The 
logic-based approaches turned out to be too strict to be practical whereas intu- 
itive algorithms for machine learning, pattern recognition etc. lacked a unifying 
theoretical basis to understand their limitations and generally demonstrated in- 
adequate performance. 

To many researchers the theory of rough sets appeared as the missing com- 
mon framework to conduct theoretical research on many seemingly diverse AI- 
related problems and to develop application-oriented algorithms incorporating 
the basic ideas of the rough set theory. Numerous algorithms and systems based 
on the rough set theory were developed during that early period, most of them 
for machine learning and data analysis tasks [4-7,13-16]. As the software for 
developing rough set-based applications was becoming more accessible, the ex- 
perience with applying rough set methods to various applications was being 
gradually accumulated. The experience was also revealing the limitations of the 
rough set approach and inspiring new extensions of the rough set theory aimed 
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at overcoming these limitations (see, for instance [8,10-12,16,17,23,24,26]). In 
particular, the definitions of rough set approximation regions [2-3], originally 
based on the relation of set inclusion, turned out to be too restrictive in most 
applications involving using real life data for the purpose of deriving empiri- 
cal models of complex systems. Consequently, typical extensions of the rough 
set model were aimed at incorporating probabilistic or fuzzy set-related aspects 
into rough set formalism to soften the definitions of the rough approximation re- 
gions to allow to handle broader class of practical problems [8,10,12,16,17,25,27]. 
This was particularly important in the context of data mining applications of 
rough set methodology where probabilistic data pattern discovery is common. 

Another trend which emerged in recent years was focused on generalizing the 
notion of the equivalence relation defining the approximation space which forms 
the basis of the rough approximations [20,22]. 

When constructing models of data relationships using the methodology of 
rough sets it is often necessary to transform the original data into a derived 
form in which the original data items are replaced with newly defined secondary 
attributes. The secondary attributes are functions of the original data items. In 
typical applications they are derived via a discretization process in which the 
domain of a numeric variable is divided into a number of value ranges. The 
general question how to define the secondary attributes, or what is the best 
discretization of continuous variables has not been answered satisfactorily so 
far despite of valuable research contributions dealing with this problem [21,25]. 
It appears that the best discretization or secondary attribute definitions are 
provided by domain experts based on some prior experience. 

Another tough problem waiting for the definite solution is the treatment of 
data records with missing or unknown data item values. Although a number of 
strategies have been developed in the past for handling missing values, some 
of them involving replacement of the missing value by a calculated value, the 
general problem is still open as none of the proposed solutions seems to address 
the problem of missing values satisfactorily. 

The main challenge facing rough set community today is the development 
of real-life industrial applications of the rough set methodology. As it can be 
seen in this volume, the great majority of papers are theoretical ones with a few 
papers reporting practical implementations. Experimental applications of rough 
set techniques have been attempted since the first days of existence of rough set 
theory. They covered the whole spectrum of application domains ever attempted 
by Al-related methods. Before the advent of the discipline of data mining, the 
methods of rough sets were already in use, for example, for the analysis of medi- 
cal or drug data, chemical process operation log data with the objective of better 
understanding the inter-data item interactions and to develop decision support 
models for medical diagnosis, drug design or process control (see, for example 
[5,9,18,27-32,36]). They were also used for experiments with speech and image 
classification, character recognition, musical sound classification, stock market 
prediction and many other similar applications. However, the applications which 
ended up in industrial use and brought concrete financial benefits are very rare. 
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This is probably due to the fact that the development of rough set applications 
requires team work and generally substantial funding, access to real-life data 
and commitment from all involved parties, including domain experts and rough 
set analysis specialists. It also requires quality software specifically tailored for 
the rough set-based processing of large amounts of data. Satisfying all this re- 
quirements is not easy and next to impossible for a single researcher with very 
limited budget resulting in the current lack of industrial applications of rough set 
approach. It appears that until rough set methodology acquires major industrial 
sponsors committed to implementing it in their products or processes we are go- 
ing to observe the continuation of the current rough set research situation which 
is dominated by theoretical works with relatively few experimental application 
projects. 

There are other challenges for the rough set methodology on the road to main 
stream acceptance and popularity. One of the important issues here appears to 
be the unavailability of popular books written in a way making them accessible 
to average computer scientists, programmers or even advanced computer users. 
Currently, the rough set literature is solely aimed at computer science researchers 
with advanced background and degrees. Clearly, it is seriously limiting the scope 
of potential users of this methodology. The related aspect is the lack of modern, 
easy to use popular software tools which would help novice and sophisticated 
users in familiarizing with rough sets and in developing simple applications. The 
tools of this kind should be extremely simple to use, with graphical user interface 
and should not require the user to do any programming in order to create an 
application. 

To have users and developers employing rough set tools, the subject of rough 
sets should be taught at universities. Although it happens occasionally, it is far 
from common practice and computer science graduates normally know nothing 
about rough sets. The proper education is also important to ensure sustained 
growth of the discipline. This requires creation of educational programs incor- 
porating rough set methodology, development of textbooks and convincing edu- 
cational examples of rough set applications. 

The establishment of permanent publicity mechanism for rough set research 
is the next major issue waiting for the solution. At the moment the publicity 
tools to propagate rough set methodology, results and applications are very 
limited. There is no specialized journal related to rough sets and there is lack of 
centralized well funded unit coordinating activities in this area. All publicity and 
organizational work in the rough set community is done by volunteers who devote 
their time and energy on the expense of other activities. It seems that some of the 
organizational tasks should be done by paid professionals, to ensure continuity 
and quality. This however requires financial resources which are currently not 
available. 

As far as the future of rough set methodology is concerned, I personally 
believe that rough sets are going to shine in the following application areas. 
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— Trainable control systems 

Trainable control systems have their control algorithms derived from training 
session operation data when the system, or system simulator, is being con- 
trolled by skilled human operator or operators. It is well known that many 
complex control problems cannot be programmed due to lack of system be- 
havioral models, complex nonlinearities etc. Problems of this kind appear in 
robotics, process control, animation etc. It appears, based on prior research 
and experience with laboratory experiments [9,27,28,31,36] that rough set 
theory provides practical solution to the problem of deriving compact work- 
ing control algorithms from the complex system operation log data. 

— Development of predictive models from data to increase the likeli- 
hood of correct prediction (data-based decision support systems) 
With the current flood of data accumulated in databases and representing 
various aspects of human activity, future decisions support systems will be 
increasingly relying on factual data in deriving their decision algorithms. 
This trend has been recognized by data mining researchers many of whom are 
adopting rough set approach. The role of rough sets in this area is expected 
to grow with the introduction of easy-to-use industrial strength software 
tools. 

— Pattern classification (recognition) 

The technology of complex pattern recognition, including sound and image 
classification is not mature yet despite some products already on the mar- 
ket. It appears that rough set theory will have an important role to play 
here by significantly contributing to algorithms aimed at deriving pattern 
classification methods from training data. 

In general, the rough set theory provides the mathematical fundamentals of a 
new kind of computing, data-based rather than human supplied algorithm-based. 
This data-based form of computing will become increasingly more important in 
the twenty first century as the limits of what can be programmed with standard 
algorithmic methods will become more apparent. A some sort of saturation will 
be observed in the class of problems solvable with the traditional techniques 
with the parallel unsaturated growth in the power of processors and ever growing 
human expectations. This will force the necessary shift to non-traditional system 
development methods in which rough set theory will play prominent role, similar 
to the role of Boolean algebra in contemporary computing devices. 
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Abstract. A possibility of extending the scope of computers to aid human 
activity is discussed. Weak points of humans are discussed and a new 
information technology that can back up the human activity is proposed. It must 
be an intelligent system to enable a computer-led interactive system. A 
conceptual architecture of the system and then various technologies to make up 
the architecture are discussed. The main issues are; a modeling scheme to 
accept and represent wide area of problems, a method for externalizing human 
idea and of representing it as a model, a large knowledge base and generation of 
specific problem-solving system, autonomous problem decomposition and 
solving, program generation, integration of different information processing 
methods, and knowledge acquisition and discovery 



1. Introduction 

Today, various human activities have to do with computers. The computer’s power is 
still growing and a computer is sharing a larger part of human activity forcing each 
activity to grow larger and more complex. As the result the whole social activity is 
expanding considerably. However, do they aid all aspects of the activity that human 
being need? The answer is “no”. In some job fields such as the recent applications 
around Internet the computers can cope with many tasks very strongly and they are 
accelerating the emergence of many new activities. But in some other fields, e.g. in 
engineering design, the computer technology stays rather in the low level comparing 
to what are needed. This inequality of the computer’s aids among human activities is 
growing. As the social activity expands the needs for such activities as design and 
development tasks also grow large without enough support method. Nevertheless 
expanding social activity requires the people the design and the development of many 
new systems. It is worried that it might bring many troubles in human society because 
the requirement might go beyond the human capability. For example, it was pointed 
out in [1] that a new type of problem was arising in the software development because 
of the increased size of software systems. In the other fields also the unexpected 
troubles are increasing such as the accidents in a nuclear plant, the failure of a rocket 
vehicle, etc. 
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There were many discussions on the causes of these accidents. Apart from those 
considered the primary causes, the lack of the person’s capability to follow the 
increase of the scale and complexity of task lies as the basis of troubles in the 
conventional human-centered developing style. In other word human being create the 
troubles and nevertheless they can no more solve these troubles by themselves. Then 
is there some one that can do it in place of human? Only possibility for answering this 
question is to make computers more intelligent and change the development style by 
introducing them into system development. Such computers must be able to back up 
the weak points of human being and reduce the human load, for the human capability 
has already come to near the limit but computer’s capability has the room to be 
enhanced by the development of new information technology. It does not mean that 
computer can be more intelligent than human being but that, as will be discussed 
below, the computer help human to make the best decision by undertaking the 
management tasks of environment of decision making. It is necessary for the purpose 
to analyze the human way of problem solving, to know the weak points of humans, to 
find a possible alternate method and then to develop new information technology that 
can back up and aid human activity. It needs AI technology but with the wider 
applicability than whatever has been developed so far [2]. 

In this paper the human activities are classified into two classes; the ones to which 
conventional information technology works effectively and the others. These are 
called the first type and the second type activities respectively. Corresponding to the 
types of activities the required information technologies are different. Roughly, it is 
said that the above classes correspond to whether an activity can be represented in an 
algorithmic way or not. 

The effectiveness of using computers has been exhibited in many applications to 
which the computer’s processing mechanism matches well with what is required by 
problem solving. These were of the first type. The typical examples are business 
applications, computations of the large mathematical functions in scientific and 
engineering applications, large-scale simulations and so on. Recently the new type of 
applications are increasing in which the computers deal with directly the signals 
generated in the surrounding systems without intervention of human being. The 
typical examples are information processing in network applications and the computer 
control systems embedded in various engineering systems like vehicle engine. These 
are however all the first type applications because each of them can be represented by 
an algorithmic method and, if the activity is specified, then the method of developing 
an information system to achieve the goal is decided without difficulty. These 
activities have been computerized. Along with the increase of the computing power, 
the number and the scale of these applications are rapidly increasing. This tendency 
urges the growth of the human activities in all aspects. 

But from the viewpoint of the information processing technology, this class of 
applications covers only a part of whole areas of human activities and there are 
various problems that are not included in this class. Actually to develop a software- 
system for an application as above is an activity that is not classified into the first 
class. Their computerization has been difficult and these are mostly left to human 
being. The objective of this paper is to discuss the characteristics of the human 
activities and the information technology that can cover both types of activities, 
especially the second type. 
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2. Limitation of Human Capability and Expected Difficulties 

What is the reason why human being could deal with successfully most activities so 
far but have troubles today? Why the current computers cannot support these 
activities well? It is necessary to answer these questions. 



2. 1 Limitation of Human Capability 

That human capability is limited is one of the reasons why new troubles occur. 

Human capability is limited in many aspects as follows. 

(1) The limitation to manage very large scale objects (scale problem); A person 
cannot include too many objects in the scope of consideration at a time. If an 
object contains many components that exceed this limit, then some components 
may be ignored and the object cannot be well managed. 

(2) The limitation to manage complex object (complexity problem); As the mutual 
relations between objects get large and dense, a change at an object propagates 
widely. A person cannot foresee its effect if the scope of the relation expands 
over the certain limit. 

(3) The limitation to follow up the rapidly changing situations (time problem); There 
are two kinds of speed problems; the limit of the human physical activities and 
the limit of the human mental activity. The first problem has been a research 
object for long time in the science and technology and is abbreviated here. The 
latter is the problem of achieving a mental work in a short time. To develop a 
large system in a short time is a typical example. 

(4) The limitation of self-management of making errors (error problem); Human 
being make errors in doing whatever the thing they may be. Sometimes it induces 
the big accidents. So far a number of methods to reduce the effects of human 
error have been developed. This situation may not change and the effort will 
continue in the same way as before. 

(5) The limitation to understand many domains (multi-disciplinary problem); When 
an object grows large and complex, it has various aspects concerning different 
disciplines. A person who deals with such an object is required to have the multi- 
disciplinary knowledge. To acquire multi-disciplinary knowledge however is 
difficult for many individuals. It is not only because of the limitation of each 
individual’s ability but also because the social education system is not suited for 
students the multi-disciplinary studies. It means that some aspects might be 
remained unconsidered well at problem solving and very often a large and 
complex object has been and will be designed and manipulated lacking the well- 
balanced view to the object. This multi-disciplinary problem is considered to be 
one of the most serious problems in near future. 

(6) The limitation to understand to each other through communication 
(communication problem); A large problem, for example a large object design, is 
decomposed to a set of small problems and is distributed to many persons. In 
order to keep the consistency of the solution for the whole problem the relations 
between the separated sub-problems must be carefully dealt with. It requires the 
close communication between persons responsible to the sub-problems. Actually 
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it is sometimes incomplete. In particular when there are time lags between each 
sub-problem solving, to keep the well communication is the more difficult. Very 
often the serious problems occur when an existing system is to be modified by a 
person who is different ftom the ones who have designed the systems originally. 
It is because the transmission of information is incorrect between them. In turn it 
is because the recording of the decision-making in problem solving is 
insufficient. 

(7) The limitation to see objectively the one’s own activity (objectivity problem); It 
is difficult for many people to see objectively their own activities and record 
them. Therefore their activities cannot be traced faithfully afterward. It results in 
the loss of responsibility. 



2. 2 Characteristics of New Problems 

On the other hand, problems arising recently have the following characteristics. 

(1) The scale of an object is growing and the number of components gets larger. 
Many artifacts recently developed, for example, airplane, buildings, airports, 
space vehicles and so on, are becoming bigger. Not only the artifacts but also 
many social systems such as government organizations, enterprises, hospitals, 
education systems, security systems and so on are growing larger. The subjects 
who manage these objects whatever they may be are persons. 

(2) Not only the number of components is increasing but also the mutual relations 
among components in an object are becoming closer. Accordingly the object is 
becoming complex. For example, electronic systems are growing more and more 
complex. Also in the social systems, e.g. in a hospital, the different sections have 
been responsible to the different aspects of disease and relatively separated 
before. But today these are becoming closely related to each other. 

(3) The time requirement is becoming more serious. Many social systems are 
changing dynamically. For example restructuring of enterprises including M and 
A (Merger and Acquisition) is progressing in many areas. Since every social 
system needs an information system, the development of a new information 
system in a short time becomes necessary. It is the new and truly difficult 
problems for human. 

(4) The extent of expertise that are needed in an activity is expanding rapidly over 
the boarder of the different domains. In order to achieve the given goal of the 
activity, the subject of the activity is required to have every related expertise in a 
view. It requires the subject to have the multi-disciplinary view. 



2. 3 Queries to Answer in Order for Avoiding Expected Unfavorable Results 

Human cannot resolve the problems expected to arise. Consequently various 
unfavorable results are expected to occur. In order to avoid them a new method has to 
be developed. The method must be such as to be able to answer the following queries. 
(1) A large problem has to be decomposed to smaller sub-problems and these sub- 
problems have to be distributed to the separated subjects of problem solving 
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(persons or computers) so that the role of each subject is confined to an 
appropriate size (scale problem, complexity problem). 

Question: How to decompose a problem? 

(2) The decomposed sub-problems are not always independent but usually related to 
each other. Then the sub-problem solvers must keep communication to each 
other. Sometimes the communication is incomplete (communication problem). 

Question: How to assure the proper communications between the different problem 

solvers? 

(3) The required speed of the system development is increasing. The human 
capability may not become able to follow this speed (time problem). Automatic 
system development, for example an automatic programming, becomes 
necessary. 

Question: How to automate system development? 

(4) A problem may concern various domains. The multi-disciplinary knowledge is 
required to the subjects of problem solving (multi-disciplinary problem). 

Question: How to assure the use of knowledge in the wide areas? 

(5) Each person may make decision in his/her problem solving. This decision may 
not be proper and it affects the final solution. The more persons concern a 
problem solving, the larger the possibility of the final solution becomes improper. 
The chance of the improper solution gets large therefore when a large problem is 
decomposed, distributed to the different people and the solutions for these sub- 
problems are integrated to the total solution. Roughly this probability is n times 
larger than that of the single problem solving provided a problem is decomposed 
into n sub-problems. Therefore the decision making in the history of each sub- 
problem solving must be recorded for being checked afterward. But very often it 
is not achieved properly (objectivity problem). 

Question: How to record the decisions made by persons correctly? 



3. An Approach toward Enlarging the Capability of Information 
System 

Since many troubles occur because of the limitation of the human capability, the new 
method must be such as to reduce the human tasks as far as possible. If persons 
cannot do it themselves, what can do it in place of human? Only possibility for its 
resolution is to make computers more intelligent to back up the weak points of 
human. A new information technology is necessary based on a new idea for the 
purpose. It does not mean that computer can be more intelligent than human being. 
Human being has the high level intelligence with their capability of making decision 
of the best quality. They can make it if they are in an environment suited for them 
with respect to the scale of the target object, the time requirement, the scope of 
required knowledge, and so on. But very often they are forced to make decision in an 
unsuited environment. The possibility of resolving this problem is to make computers 
more intelligent to back up the human activity first in order to let them know their 
failure and more positively to reduce the rate of wrong decisions by providing the best 
environment for human for making decision. 
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3.1 Computer-Led Interactive Systems 

The minimum requirement for such computers is to record the history of decisions 
made by human being. One of the problems involved in many large and complex 
problem solving is that a problem is decomposed and shared to many people. Then 
decisions made by them become invisible afterward. If some decision is improper the 
total system does not work properly but no one can check it. The system that satisfies 
the above requirement improves this situation. It leads us to changing the style of 
human-computer interaction from the conventional human-led interactive systems to 
computer-led interactive systems where X-led interactive system means that X has 
initiative in interaction and become able to manage problem-solving process. Then it 
can make records of individual decisions made by human being in this process. This 
system resolves the objectivity problem by making the record of the problem solving 
history instead of the human problem solver. 

Generally speaking, a problem solving is composed of a number of stages such as 
problem generation, problem representation, problem understanding, solution 
planning, searching and deciding solution method, execution of the method, and 
displaying the solution. Currently computers join only partly in this process. Persons 
have to do most parts, i.e. from the beginning until deciding solution method and, in 
addition, making programs thereafter in order for using computers. Many decisions 
are made in this process. In order to record these decisions with the background 
information based on which these decisions are made, the process has to be managed 
by computers. Automating problem solving to a large extent is necessary for the 
purpose. It does not mean to let computers do everything but to let them manage the 
problem solving process. 

Hence autonomy is primarily important for achieving the goal of computer-led 
interactive systems on one hand. If autonomy could be achieved to a large extent, on 
the other hand, the system can do not only making the record of decision making but 
also providing positively the proper environment for human decision making. For 
example, the system can decompose a large-scale problem into smaller-scale 
problems suited for many people and also prepare and use multi-disciplinary 
knowledge in various domains. 

It is possible to develop an information technology to aid human activity in this 
manner. The basic requirements for the system design are derived by an effort to 
answer the queries given in 3.3. 



3.2 Autonomous System 

Autonomy is defined in [3] in relation with agency as an operation without direct 
intervention of human. It also says that autonomous system to have some kind of 
control over its actions and internal state. But this is a too broad definition for those 
who are going to realize autonomy actually. As every problem solving method is 
different by the problems to be solved, what is the strategy for controlling the 
operation to be determined for each problem? 

Therefore, it is necessary to make it more concrete. Instead of this broad definition, 
the autonomy is defined in this paper as the capability of a computer to represent and 
process the problem solving structure required by the problem. The problem-solving 
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structure means the structure of the operations for arriving at the solution of the given 
problem. It is different by each specific problem but the same type problems have the 
same skeletal problem solving structure. For example design as a non-deterministic 
problem solving is represented as the repetition of model analysis and model 
modification (Figure 1). Its structure is not algorithm-based as in the ordinary 
computerized methods. The detail of the operations for design is different by the 
domain of the problem and the extent to which the problem is matured. But the 
difference in the detail can be absorbed by the use of the domain specific knowledge 
base provided the method of using the domain knowledge base for problem solving is 
made the same even for the different problems. Then a unified problem solving 
structure is made to this type of problems. Conversely, the problem type is defined as 
a set of problems with the same problem solving structure. The necessary condition 
for a computer to be autonomous for wide classes of problems therefore is that the 
computer is provided with such a mechanism as to generate the proper problem- 
solving structures dynamically for the different types of problems. 

When a problem is large, the problem decomposition is necessary before going 
into the detailed problem solving procedure for finding solution. The problem 
decomposition is problem dependent to a large extent but the decomposition method 
can be made common to many types of problems. System autonomy implies to 
decompose problems autonomously. Then the system resolves the size / complexity 
Problem. 



4. System Architecture for Assuring System Autonomy 

The autonomous system must be able first of all to represent problem solving 
structures for different problem types responding to the users request and deal with 
them autonomously. A special organization is required for the system. A conceptual 
architecture of the system and various component subsystems are discussed. 



4.1 Conceptual Architecture of Autonomous System 

From the viewpoint of information processing technology, there are two classes of 
problems; those that can be solved by the deterministic problem solving methods and 
the others. For the deterministic class problems the stage of providing a problem- 
solving method and that of executing it for obtaining a final solutions can be 
separated as the independent operations. For non-deterministic problem solving, to 
the contrary, these two operations cannot be separated but are closely related to each 
other. This requires a trial-and-error approach. In current computer’s method to make 
a program and to execute it are separated as the different operations. It is suited only 
for the deterministic problem solving as an intrinsic nature of computers based on 
procedural language. It cannot be an autonomous system for the non-deterministic 
problems. A different architecture from this that enables the trial-and-error operations 
is necessary for realizing the autonomous system. 
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Fig. 1. Problem solving strueture 



Trial-and-error operations can be realized provided modularized representations of 
operations are available in computers. It is because the different combinations of 
operations can be generated to correspond to the different trials. For realizing this idea 
a completely declarative language and its processor must be provided on the 
conventional CPU with the procedural language. Figure 2 illustrates a conceptual 
architecture for achieving this idea. A CPU is the processor of procedural language. 
An inference engine as the processor for declarative language is implemented as a 
special procedural program. 

Thus the activities in a computer are classified into two classes in relation with the 
required function in the computer; the activities described in the procedural form and 
those described only in declarative form. The classification principle is as follows. 
The activities that can be fixed to the detail in advance are represented in the 
procedural form such as operations around CPU including OS while the activities to 
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adapt to the changing environment, mainly for applications, must be represented in 
the declarative form. There are some difficult cases to decide with this principle. 
Every activity requires a proper control. To represent the problem solving structure 
and the control of operations in the declarative form requires meta-rule. In many cases 
a meta-rule can cover wide scope of operations. That is, a single high-level control 
expression can control a scope of operations and a high-level operation can be fixed. 
In this paper however the activities that are defined in relation with applications are 
represented in the declarative form. It assures the flexibility of changing the control 
rule. In order to achieve this goal, a declarative language is designed to link easily 
with procedural operation. 




Fig. 2. Conceptual architecture of new information systems 



4.2 Components of Autonomous System 

The key issue is to clarify the requirement for the knowledge processing subsystem in 
Figure 2 and to develop a new technology to satisfy the requirement. Basic issues to 
constitute the technology are as follows (Figure 3). 

1 . New modeling method 

2. Aiding human extemalization 

3. Farge knowledge base and problem solving system generation 
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4. Autonomous problem decomposition and solving 

5 . Generation of program 

6. Integration of different IPUs of different styles 

7. Knowledge gathering, acquisition and discovery 

Though these form the kernels of the system, only the outlines of the approaches the 
author’s group are taking are introduced because of the lack of the paper. For the 
further details see [4]. 
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4.2.1 New Modeling Method 

As has been discussed, a problem solving is composed of a number of stages. 
Currently computers join only partly in this process. The objective of making 
computer-led interactive systems is to move most of the stages in this process to 
computer. Since problem is made in person’s side, he/she has to represent it 
explicitly. This problem representation has to be accepted by the computer 
immediately. It is desirable that the computer helps persons to represent his/her 
problem and does everything afterward until to obtain solution. 

Thus the difference between the conventional style and the new style in problem 
solving is in the location of a problem being transferred from human to computer and 
accordingly its representation. In the old style the problem is represented in the form 
of program but in the new style the problems must be represented in the form close to 
those generated in human brains, be transferred to computers as soon as they are 
created and be processed there. The formal representation of the problem is called 
here a problem model. 

A problem is a concept created in a person’s brain and a model is its explicit 
representation. The principle of problem modeling is to represent everything that 
relates problem solving explicitly so that no information remains in human brain in 
the form invisible to the others. 

A method of problem model representation must be decided such that variety of 
problems can be represented in the same framework. Here is a big issue of ontology 
[5, 6]. As will be discussed in the next section, the problem model is represented as a 
compound of predicates and the structures of the conceptual entities that are also 
related with the language. In order to assure the common understanding of the model 
by many people, the structuring rules must be standardized first of all so that people 
come to the same understanding of the meaning of a structure. It is also necessary that 
people have the common understanding for the same language expressions [7]. In this 
paper it is assumed that these conditions are met. This paper does not go into the more 
detailed discussions on ontology. 

Problem model is a formal representation of user’s problem. It must be 
comprehensive for persons because it is created and represented by person. It must 
also be comprehensive for computers because it is to be manipulated by computers. It 
plays a key role in this new information technology. It is assumed in this paper that 
every problem is created in relation with some object in which a user has interest. If a 
representation of an object is not complete but some part lacked, then problems arise 
there. Therefore to represent a problem is to represent the object with some part 
lacking, and to solve problem is to fill the lacked part. This is named an object model. 
The basis of the object model is to represent the relation between the structure of the 
object being constructed from a set of components and functionality of every 
conceptual object (the object itself and its components). This is the definition of an 
object model in the problem representation in this paper. 

Actually only the limited aspects of object within the scope of user’s interest can 
be represented. Then different problems can arise from the same object depending on 
the different views from the users. It means that representations not only of an object 
but also of person’s view to the object must be included in the problem model in order 
to represent the problem correctly. 
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A person may have interests in everything in the world, even in the other person’s 
activity. This latter person being interested by the person may have an interest in still 
the other person. It implies that, if the persons are represented explicitly as subjects, a 
problem model forms a nest structure of the subject’s interests. It is illustrated in 
Figure 4. For example, a problem of making programs needs this scheme. Program is 
a special type of automatic problem solver and three subjects at the different levels 
concern defining automatic programming. Subject SI is responsible to execute a task 
in, for example, a business. Subject S2 makes program for the task of SI. For the 
purpose the subject S2 observes the Si’s activity and makes its model as an object of 
interest before programming. Subject S3 observes the subject S2’s activity as an 
object and automates S2’s programming task. The activities of these subjects can be 
represented by the predicates such as processTrans(Sl, Task), makeProgram(S2, 
processTrans(Sl, Task), Program) and automateActivity (S3, makeProgram(S2, 
processTrans (SI, Task), Program), System), respectively. The upper activities need 
high order predicates. This kind of stratified objects/ activity is called the Multi-Strata 
Structure of Objects and Activities. A model of Multi-Strata Activities is called a 
Multi-Strata Model [8, 9]. 

A multi-strata model is composed from three different sub-models; Pure Object 
Model, User Subject Model and Non-User Subject Model. In the above example, S3 
is the user who intends to solve a problem of making an automatic programming 
system. Looking from the user, the subjects S2 and SI are in the objects being 
considered by the subjects S3 and S2 respectively and, therefore, the non-user 
subjects. The subject SI does a task as a work on a pure object. 

These are in the following relations (Figure 4). 

Problem Model = Subject Model + Pure Object Model 

Subject Model = User Subject Model + Non-User Subject Model 
A problem model changes its state during problem solving. The model is classified 
into three classes according to the progress of problem solving; an incipient model, a 
sufficient model and a satisfactory model. An incipient model is the starting model 
with the least information given by user. The satisfactory model is a model that is 
provided with all the required information, i.e. solution. 

The sufficient model is an intermediate- state model. Starting from the incipient 
model, information is added to the model until it reaches the state with sufficient 
information for deriving the satisfactory model. 

It is desirable that the system collects as large as possible amount of information 
from inside and outside information sources in order to reduce the burden of the user 
to build a sufficient model with large amount of information. It is required to the 
system to gather the information using the information in the incipient model as the 
clue so that the incipient model, therefore the user’s effort, can be made the smallest. 
After the model is given the enough information, the user can modify it incrementally 
to make it a sufficient model. The sufficient model made in this way however is not a 
correct model but is the model for starting problem solving in the narrow sense to 
make it the satisfactory model. 




20 S. Ohsuga 



Subjects^ User 



User Model 




Fig. 4. Multi- strata objeet and model 



Incipient Model Sufficient Model Satisfactory Model 

4 4 4 

Human builds Computer add information Computer fill the blank 
Human aids Human makes decision 

The major inside information source is a case base. It must be well structured for 
enabling the system to build a sufficient model as close as to what the user wishes. 
The outside information sources are in the web. The system must be provided with the 
search engine. But presently it is not easy to use outside information for autonomous 
problem solving. A new technology must be developed including ontology. 

Various types of problems are defined depending on the lacked part in a model. 
For example, if some functionality of an entity in a pure object model is lacked, an 
analytic problem is generated. If some functionality is given as the requirement but 
the structure of entities in a pure object model is lacked, then a design type problem 
arises. If the structure of activities is lacked, then scheduling type problem arises. 

By representing problems explicitly in this form information is made visible and 
therefore resolves objectivity and communication problems. 
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4.2.2 Aiding Human Externalization 

Externalization is a computer’s supports for (1) human cognitive process to help user 
for clarifying his/her idea and (2) model building in order to represent the idea in the 
form of model. It is a human- computer interface in the much broader sense than the 
ordinary ones. 

Very often novice users are not accustomed to represent their ideas formally. 
Sometimes, their ideas are quite nebulous and they cannot represent the ideas in the 
correct sentences. How can the system help these users? This issue belongs to 
cognitive science. What the system can do is to stimulate the users to notice what they 
are intending. Some researches are made on this issue [10, 11] but these are not 
discussed in this paper any more. In the following it is assumed that the users have 
clear ideas on their problems. The system aids them to build models to represent the 
ideas. Easy problem specification, computer-led modeling and shared modeling are 
discussed. 

( 1 ) Easy problem specification 

The model concerns the type of problem the user wants to do. The problem solving 
structure for the problem type must be made ready before going into problem solving. 
In general it is not easy for the novice users. The system allows the user to take such a 
method that the user says what he/she likes to do and the computer provides problem- 
solving structure. 

Example; Let the user wants to do a design task in a specific domain, say 
#domain-j. The user selects Design as the type of problem and #domain-j as the 
domain. It is formalized to designObject(userName, Object, #domain-j). Then the 
system uses the following rule to generate the problem solving structure for design. 
designObject(HumanSubject, Object, Domain) :- 

designManagement(Computer, designObject(HumanSubject, Object, Domain)) 

designManagement represents the design-type problem solving structure. It is 
further expanded to the set of rules to represents the structure. 

In reality, it is possible to define the more specific design-type problem like 
designMaterial if the latter has the specific problem solving structure for designing 
material. This structure is specialized to the design in a specific domain, say material 
in this case. In this way problem-type is classified as fine as possible and forms a 
hierarchy. This hierarchy of problem-types is presented to the user and the latter 
selects one or the more types. The problem domain is also divided to the sets of 
narrower domains and these domains form a hierarchy. 

(2) Computer-led modeling and system building 

In building a sufficient model the computer adds information to the model as much as 
possible. The main source of information in this case is the case base. In some special 
cases, special knowledge is used in order to make the model sufficient [4]. 

To allow persons to share model building 

The model of any large-scale problem is also large. It is difficult to build up it 
alone but model building has to be shared by many people. In figure 5 the user 
represents his/her intention in the form of the user model. At the same time he/she can 
makes the other part of the problem model that he/she wants to decide oneself figure 
5 represents an example of modeling an enterprise. The user specifies a part of the 
model enclosed by the thick line other than the user model including the subjects with 
their activities. An externalization subsystem is provided to every subject to which 
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human is assigned. After then the control moves downward to the lower nodes and 
human subjects behave as the users successively. That a human subject behaves as a 
user means that the extemalization subsystem begins to work for the subject for 
aiding the human to build the still lower part of the model. But the activity of every 
lower human subject has already been decided by an upper human subject and is 
given to the subject. The subject is obliged to work to achieve the activity. 




Fig. 5. An illustration of problem model 
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In this way the subjects SubjectA, SubjectB, SubjectC and so on in Figure 5 
behave as the users one by one and extend the model downward. As will be described 
in 4.2.4, model building concerns closely with the object decomposition, and this 
model decomposition concerns the activity of the related subject. For example, let the 
subject be given a task of design as the required activity. The design-type problem- 
solving structure is prepared. According to this structure the object model is built 
downward from the top. Following to this object model a subject model is formed 
(refer to 5.2.4). 

4.2.3 Large Knowledge Base Covering Wide Application Areas and Problem 
Solving System Generation 

To deal with a multi-disciplinary problem automatically means that the system is 
provided with multi-domains knowledge. From the practical point of view to use 
directly the large knowledge base is ineffective because there is a lot of irrelevant 
knowledge. A specific problem uses only a specific knowledge that concerns the 
related domains. A method to extract dynamically only the relevant knowledge for the 
given problem in a large knowledge base must be developed [12]. This means that the 
real problem solving system has to be generated automatically for every specific 
problem. Thus the new information system is a multi-level system composed of the 
one to solve a problem actually and the other one to generate this actual problem 
solving system. The large knowledge base must be conveniently organized in order to 
facilitate this knowledge extraction and system generation. A problem specific 
problem-solving system is to be generated based on the problem solving structure as 
is shown in Figure 6 



4.2.4 Autonomous Problem Decomposition and Solving 

The autonomous problem solving is the key concept of the system. In principle, it is 
to execute the problem solving structure for the given type of problem. It is 
abbreviated here. Refer to [12,4]. If the problem is large, it is decomposed to a set of 
smaller problems before or together with the execution of the problem solving 
structure. This problem decomposition and sharing is also achieved semi- 
automatically based on the problem model. 

In many cases models are built top-down. An object is decomposed from the top. 
The way for decomposing the object depends on the problem type, problem domains, 
and to what extent the problem area is matured. For example, if the problem is to 
design an airplane, there is such a decomposition rule of an airplane usually used as 
Aircraft {Engine(s), Main-wing, Control-surfaces, Fuselage, Landing-gear, Wire- 
harness, Electronic-System, — }. 

Using this rule the system can design an aircraft as composed of these assemblies. 
After then human can change this structure according his idea. He/she decides also 
the functional requirement to every assembly. This stage is called a conceptual 
design. Within this scope of object structure this design is evaluated whether the 
design requirement for the aircraft could be satisfied with this structure and the 
functional requirements to these assemblies. After then the design work moves 
downward to each assembly. The functional requirement decided as above to each 
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assembly becomes the design requirement for the assembly and the assembly is 
decomposed further. 




Corresponding to the object decomposition a new subject is assigned to each new 
assembly resulting in a new subject structure. The same activity as that of the parent 
subject, i.e. ‘designObject’ in this case, is given to every new subject. Thus created 
pair of new object and new subject forms an agent in the system. That is this problem 
decomposition creates a new multi-agent system. Each agent shares a part of design. 

Object Model Decomposition; Object {Objectl, Object!, — , ObjectN} 
Object-Subject Correspondence; | | | 

Subject Model Formation; Subject {Subjectl, Subject!, — , SubjectN} 

The way for decomposing the object model is different by the problem. In the 
design type problem the object model is built in such a way that the required 
functionality of the object can be satisfied and a new person is assigned to a new 
object. In the other case the user at the top specifies the other activity. For example 
the user specifies an evolutionary rule to decompose an object automatically as an 
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activity (instead of ‘designObject’) to the top subject. This rule is to decompose an 
object toward adapting the environment given by the user. Then an evolutionary 
system to adapt to the environment is created. 

4.2.5 Generation of Program 

Among various kinds of time problems the requirement for the rapid software 
development is becoming a serious problem recently. It means that the conventional 
style of system development by human programmer cannot be continued any longer 
but automation or, at least, semi-automation is necessary. Considering that variety of 
information-processing requirement increases in near future, not only procedural 
program but also the more general class of information processing method, e.g. 
neural-network, have to be developed in computers semi-automatically. There are two 
approaches to the system development. The one is to generate a new program to meet 
the given condition starting from scratch and the second is to integrate the existing 
programs to a program with the larger input scope. 

Every procedural program is a representation of a compound of subject activities in 
the programming language. This compound is represented as a structure of the related 
activities. Programming is to specify a structure of the activities and then to translate 
it to a program. An object to be programmed, i.e. the structure of the activities, is 
obtained by the exploratory operation to find the route to reach the goal from the 
given problem. That is, the programming is the posterior process of normal problem 
solving. 

The automatic programming proceeds as follows (refer to Eigure 5). 

(1) To collect the activities in the problem model that are to be included in the 
compound activity to represent a program . 

(2) To make the meaningful connection between activities where the meaningful 
connection is to connect one’s output port to the other’s input port so that the 
resulting compound is meaningful. This connection is made locally between a 
pair of activities. It may have already been made before and included in a case or 
the user makes a new one. 

(3) To define inputs of some activities and outputs of some activities as the global 
inputs from the outside and the global output ports to the outside respectively. 

(4) To explore a route from the global inputs to the global outputs for the specific 
instance values substituted to the global input variables by inference resulting in a 
deduction tree. 

(5) To generalize the deduction tree restoring the variables from the instance values 
resulting in the generalized deduction tree including necessary program structures 
for the target program. 

(6) To translate the deduction tree into another structure of functions according to the 
program- structuring rule such as sequence, branch, loop, etc. 

(7) To translate the structure of functions into a specific program code. 

Auxiliary stage; As an auxiliary procedure for the special case such as being 
required to meet the real time operation, the following stage is provided. 

(8) To develop a simulator for estimating the execution time and to make a problem 
solving structure for real-time programming type problem to include this 
simulator. 

The operation (8) is essential in case of embedded software. The embedded 
software, for example, the software embedded in vehicle engine, is becoming large 
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and its role is increasing. Special attention must be paid for these embedded systems 
because of their peculiar developing condition [13]. 



4.2.6 Integration of Different Information Processing Methods 

Integration is to merge two or more information processing units (IPUs hereafter) of 
the different styles such as a procedural program, a knowledge-based processing unit, 
a neural network and so on into an IPU with a larger scope. It needs a special method 
of transformation between different representation schemes. To find such a 
transformation as a general method is not easy but in many cases it has been 
performed manually in an ad hoc way for each specific pairs of IPUs to be merged. In 
this paper its automation is intended. For this purpose the integration problem is 
considered in two steps. 

In the first step integration is defined as an operation of an IPU to take in the other 
IPU(s) and to use its function in order to realize a new function as a whole. The 
former is called the master IPU and the latter is the slave IPU. In order to combines 
the slave IPU in its operation, the master IPU must be able to represent the feature of 
the slave IPU in its own representation scheme. The master is also required to 
transform the format of some data in the master to the input of the slave and the 
output of the slave to the master in such a way that the operation of the slave can be 
activated properly in the master IPU. These representations are added to the master 
IPU from outside. Thus the master IPU has to be able to accept this additional 
information and expand its scope without disturbing its own processing. The 
expanded master IPU can decide the time to evoke the slave and translate the data to 
send to the slave. Receiving the output from the slave and translating it into the 
master scheme, the master IPU can continue the operation. 

There can be a number of information-processing styles and for some of them it is 
difficult to meet this condition to be the master IPU. As the typical example, neural 
network, procedural processing and declarative processing are considered. A neural 
network has a fixed mechanism of processing and it is difficult to accept the 
information on the slave IPU from outside without changing the whole structure. A 
procedural processing should be more flexible than the neural network but in reality it 
is still difficult to add such information as above from outside without changing the 
main body of the program. That is, the time and way of evoking the slave and 
transformation routines must be programmed and embedded in the main program in 
advance. It is to rebuild the program and automation of integration adapting to the 
changing environment is difficult. Finally a declarative processing, for example a 
rule-based system using ‘If-Then’ rule, adding information from outside does not 
require the modification of the main part of the system but the system can expand the 
scope of processing by combining the added rules with the existing ones. This is an 
important characteristic of the declarative processing system. It makes the declarative 
processing potentially the best candidate of the master IPU. It can integrate various 
IPUs of the different styles. The descriptive information on the slave to be added to 
the master IPU must exist at integration and is added to the knowledge base of 
declarative processing system. This information on the slave is assumed created when 
the slave IPU was first developed. 

In the second step this idea is generalized. Let two IPUs, say (la, Ma) and (Ib, Mb), 
of any information processing styles, Sa and Sb respectively, be to be merged. If at 
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least one of them is a declarative IPU, then the integration as discussed above is 
possible by making this IPU the master. If both IPUa and IPUb are not the declarative 
ones, their direct integration is difficult. But their integration becomes still possible by 
introducing a new declarative IPU. In this case this third IPU is the master to both 
IPUa and IPUb, and the representations of both IPUa and IPUb are added thereto. 
Thus non-declarative IPUs can be integrated indirectly via the third declarative IPU. It 
is possible to prepare such a declarative IPU for the purpose of integrating arbitrary 
IPUs. 

4.2.7 Acquiring Knowledge 

The issues 4.2.1 through 4.2.6 are on the most fundamental parts of the system for 
achieving the autonomy. Therefore these parts must be developed in a relatively short 
time. But this system assumes the existing of the large amount of knowledge to be 
used for solving various problems. A method for providing the system with the 
knowledge is necessary. The amount and quality of knowledge contained in the 
systems will decide finally the value of the intelligent system. Generally speaking 
however this part of technology is not yet well developed. Though research interests 
to the methods for creating knowledge from various information sources are 
increasing recently and the new research groups are being formed, it is not yet 
established and is the long-term goal. 

The major sources of knowledge are as follows. 

(1) Human knowledge; Human is the major source of knowledge. Knowledge 
acquisition by interviewing to the experts is still the main method of human 
knowledge acquisition. In addition it is desirable that the system acquires 
knowledge made by user while using the system for problem solving. 

(2) Text ; Published or recorded texts are also important knowledge sources. There 
are two kinds of texts; the edited text and the non-edited text. 

(i) Edited text ; Some texts have been edited and compiled as the knowledge 
for the purpose of making hard copied knowledge base such as 
handbooks in many technical fields, dictionary, etc. 

(ii) Unedited text; there are lot of texts that are generated and recorded in the 
daily activities such as the claims from the clients for the industrial products, 
the pilot’s report after every flight including near-miss experience, etc. These 
texts are accumulated but hardly used because it takes a large man-hour to 
read and to use afterward. But the information included therein is quite 
important for improving the wrong situation unnoted before. A method to use 
the information autonomously is desirable. Today text mining and 
knowledge discovery from text are being made research for this class of 
texts. 

(3) Data ; Huge amount of data are being accumulated in many areas of human 
activity and are still increasing. Some general information is hidden behind these 
data. Data mining and knowledge discovery in data (KDD) is the efforts to find 
hidden useful information. 

(4) Program ; So far lot of programs have been developed. Every program was 
developed based on knowledge on some object at the time of development. After 
years the programmer leaves and the program remains being used but no one can 
understand the knowledge behind the program. To extract knowledge from the 
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program is becoming very important tasks. The knowledge extracted from 
programs can be used in many ways. 

The issues (2) through (4) are the inverse problems and need the new technologies. 
Researches are being made in many places and recently there are many international 
conferences. But the technologies for discovering knowledge are not yet established. 
It will take more time. In this sense, these are the long-term goal. 



5. Conclusion 

It is worried that human capability cannot catch up the progress of increasing scale 
and complexity of problems to arise in near future and accordingly many troubles can 
happen in human society. Is there some one that can resolve these troubles in place of 
human being? Only possibility for answering this question is to make computers more 
intelligent so that human can be free from such works that are weak in achieving for 
them. The author has discussed in this paper a way to make computers more 
intelligent and expand the scope of information processing. It does not mean that 
computer can be more intelligent than human being but that the computer help human 
to make the best decision by undertaking the management tasks of environment of 
decision making. This paper analyzed the weak points of humans, found a possible 
alternate method and then to proposed a way to develop new information technology 
that can back up and aid human activity. 

The major topics discussed in this paper were; (0) an overall software architecture 
for future information systems, (1) modeling scheme to accept and represent wide 
area of problems, (2) method for externalizing human idea and of representing it as a 
model, (3) large knowledge base and problem solving system generation, (4) 
autonomous problem decomposition and solving, (5) generation of program, (6) 
integration of different IPUs of different styles, and (7) knowledge acquisition and 
discovery. 

A number of difficult problems were involved in this scope. Only an outline of a 
part of the research project on the way to resolve these difficulties has been presented. 
The research project started in 1998 under the sponsorship of The Japanese Science 
and Technology Agency and the issues (0) through (6) are being developed as the 
short period target (5 years). There remain many problems yet unsolved. Among them 
the issue of ontology will become the more serious problem in the future. It becomes 
more important when the system gather information widely spreading in the web. The 
issue (7) is included in the project but is not included directly in the short period 
target. It is included as the basic research to be continued in the following research. 

A new representation scheme was necessary for representing these new concepts as 
shown above and a language/system MLL/KAUS (Multi-Layer Logic/ Knowledge 
Acquisition and Utilization System) suited for the purpose has been developed. It was 
not included in this paper. Refer to [14,15]. 
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Abstract. Rough set based data analysis starts from a data table, called an in- 
formation system. The information system contains data about objects of interest 
characterized in terms of some attributes. Often we distinguish in the informa- 
tion system condition and decision attributes. Such information system is caJled 
a decision table. The decision table describes decisions in terms of conditions that 
must be satisfied in order to carry out the decision specified in the decision table. 
With every decision table a set of decision rules, called a decision algorithm can be 
associated. It is shown that every decision algorithm reveals some well known prob- 
abilistic properties, in particular it satisfies the Total Probability Theorem and the 
Bayes’ Theorem. These properties give a new method of drawing conclusions from 
data, without referring to prior and posterior probabilities, inherently associated 
with Bayesian reasoning. 



1 Introduction 

Rough set based data analysis starts from a data table, called an informa- 
tion system. The information system contains data about objects of interest 
characterized in terms of some attributes. Often we distinguish in the infor- 
mation system condition and decision attributes. Such an information system 
is called a decision table. The decision table describes decisions in terms of 
conditions that must be satisfied in order to carry out the decision specified 
in the decision table. With every decision table we can associate a decision 
algorithm which is a set of if... then... decision rules. The decision rules can 
be also seen as a logical description of approximation of decisions, and con- 
sequently a decision algorithm can be viewed as a logical description of basic 
properties of the data. The decision algorithm can be simplified, what results 
in optimal description of the data, but this issue will not be discussed in this 
paper. 

In the paper first basic notions of rough set theory will be introduced. 
Next the notion of the decision algorithm will be defined and some its basic 
properties will be shown. It is revealed that every decision algorithm has 
some well known probabilistic features, in particular it satisfies the Total 
Probability Theorem and the Bayes’ Theorem [5]. These properties give a 
new method of drawing conclusions from data, without referring to prior and 
posterior probabilities, inherently associated with Bayesian reasoning. Three 
simple tutorial examples will be given to illustrate the above discussed ideas. 
The real-life examples are much more sophisticated and will not be presented 
here. 

W. Ziarko aiid Y. Yao (Eds.): RSCTC 2000, LNAI 2005, pp. 30-45, 2001. 
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2 Approximation of Sets 

Starting point of rough set based data analysis is a data set, called an infor- 
mation system. 

An information system is a data table, whose columns are labeled by 
attributes, rows are labeled by objects of interest and entries of the table are 
attribute values. 

Formally, by an information system we will understand a pair S = (t/, A), 
where U and A, are finite, nonempty sets called the universe^ and the set of 
attributes, respectively. With every attribute a € A we associate a set Va, of 
its values, called the domain of a. Any subset B oi A determines a binary 
relation I{B) on U, which will be called an indiscernihility relation, and 
defined as follows: (x,y) G I{B) if and only if a{x) = a{y) for every a G A, 
where a{x) denotes the value of attribute a for element x. Obviously I{B) is 
an equivalence relation. The family of all equivalence classes of I{B), i.e., a 
partition determined by B, will be denoted by U/I{B), or simply by U/B; 
an equivalence class of I{B), i.e., block of the partition UjB, containing x 
will be denoted by B{x). 

If {x,y) belongs to I{B) we will say that x and y are B -indiscernible 
{indiscernible with respect to B). Equivalence classes of the relation I{B) 
(or blocks of the partition U/B) are referred to as B-elementary sets or B- 
granules. 

If we distinguish in an information system two disjoint classes of at- 
tributes, called condition and decision attributes, respectively, then the sys- 
tem will be called a decision table and will be denoted by 5 = (U,C,D), where 
C and D are disjoint sets of condition and decision attributes, respectively. 

Suppose we are given an information system S = {U,A), X C U, and 
B C A. Our task is to describe the set X in terms of attribute values from 
B. To this end we define two operations assigning to every X C U two sets 
B^{X) and B*{X) called the B-lower and the B-upper approximation of X, 
respectively, and defined as follows: 

B.(X) = U {B{x) : B{x) C X}, 

xeu 

B*(X) = y {B{x) ; B(x) n X 7 ^ 0}. 
xeu 

Hence, the 5-lower approximation of a set is the union of all 5-granules that 
are included in the set, whereas the B-upper approximation of a set is the 
union of all 5-granules that have a nonempty intersection with the set. The 
set 



BNb{X) = B*(X)-B4X) 



will be referred to as the B-boundary region of X. 
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If the boundary region of X is the empty set, i.e., BNb{X) = 0, then X 
is crisp (exact) with respect to B; in the opposite case, i.e., if BNb(X) ^ 0, 
X is referred to as rough (inexact) with respect to B. 



3 Decision Rules 



In this section we will introduce a formal language to describe approximations 
in logical terms. 

Let 5 = (?7, A) be an information system. With every B C Awe associate 
a formal language, i.e., a set of formulas For(B). Formulas of For(B) are 
built up from attribute- value pairs (a,u) where a ^ B and u G Va by means 
of logical connectives A (and), V (or), ~ (not) in the standard way. 

For any ^ G For(B) by ||^||5 we denote the set of all objects x E U 
satisfying ^ in S and refer to as the meaning of ^ in 5. 

The meaning ||^?||5 of ^ in S is defined inductively as follows: 

||(a,u)||5 = {x ^ U : a(v) = x} for all a G and v G Va, ||^ V = 
11^115 U II^IIsJI^ A 0^115 = Pll^n 11^115, II - ^||5 = t/- 11^115- 

A formula ^ is true in 5 if ||^||5 = U. 

A decision rule in 5 is an expression ^ ^ read if ^ then F, where 
^ G For(C), ^ G For(D) and C,D are condition and decision attributes, 
respectively; and F are referred to as conditions and decisions of the rule, 
respectively. 

A decision rule ^ ^ W is true in S if ||^||5 C ||^||5. 

The number supps(^,^) = card(\\^ A ^\\s) will be called the support 
of the rule ^ ^ W in S. We consider a probability distribution pc/(a;) = 
l/card(U) for x G 1/ where U is the (non-empty) universe of objects of 5; we 
have pu(X) = card(X) / card(U) for X CU. For any formula ^ we associate 
its probability in S defined by 

7Ts(^) = pu(\\^\\s)- 

With every decision rule ^ ^ ^ we associate a conditional probability 



7r5(^|^)=Pt/(||^||5| ll^lls) 



that F is true in S given ^ is true in S called the certainty factor, used first 
by Lukasiewicz [3] to estimate the probability of implications. We have 



7Ts(^\^) = 



card(||^ A^^lls) 
card(\\^\\s) 



where ||^||5 ^ 0- 

This coefficient is now widely used in data mining and is called confidence 
coefficient. 

Obviously, 7Ts(^\^) = 1 if and only if ^ is true in S. 
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If 7Ts{^\^) = 1, then ^ will be called a certain decision rule; if 

0 < TTsi^l^) < 1 the decision rule will be referred to as a uncertain decision 
rule. 

Besides, we will also use a coverage factor (used e.g. by Tsumoto [14] for 
estimation of the quality of decision rules) defined by 

Tcsm=puim\s\ ii^'iis). 



which is the conditional probability that ^ is true in 5, given ^ is true in S 
with the probability Obviously we have 



7T5(^|<?) = 



card{\\^ A 
carddl^Z'll^) 



The certainty factors in S can be also interpreted as the frequency of 
objects having the property ^ in the set of objects having the property ^ 
and the coverage factor - as the frequency of objects having the property ^ 
in the set of objects having the property 
The number 



(7s{$,^) = = TTs{^\^) ■ 7rs(<f) 

cara{U ) 

will be called the strength of the decision rule ^ ^ ^ in S. 



4 Decision Algorithms 

In this section we define the notion of a decision algorithm, which is a logical 
counterpart of a decision table. 

Let Dec{S) = {^i — > m > 2, be a set of decision rules in a decision 

table5= (t/,C,D). 

1) If for every ^ ^ ^ G Dec{S) we have ^ or 1 A Is = 0, 

and or \\^ A ^*\\s = then we will say that Dec{S) is the set of 

pairwise mutually exclusive (independent) decision rules in S. 

m m 

2) If II \! ^^\\s = U and || \J Wi\\s = U we will say that the set of decision 

i=l i=l 

rules Dec(S) covers U. 

3) If ^ ^ ^ G Dec(S) and supps(^j^) 7^ 0 we will say that the decision 
rule ^ is admissible in S. 

4) If (J C^(X) = 11 V ^||s where Dec'^(S) is the set of all 

XeUjD ^^^€Dec+{S) 

certain decision rules from Dec(S)^ we will say that the set of decision 
rules Dec(S) preserves the consistency of the decision table S = (f/, C, D). 

The set of decision rules Dec(S) that satisfies I), 2) 3) and 4), i.e., is 
independent, covers J7, preserves the consistency of S and all decision rules 
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^ ^ G Dec{S) axe admissible in 5 - will be called a decision algorithm in 

5. 

Hence, if Dec{S) is a decision algorithm in S then the conditions of rules 
from Dec{S) define in 5 a partition of U. Moreover, the positive region of D 
with respect to C, i.e., the set 

U 

xeuiD 

is partitioned by the conditions of some of these rules, which are certain in 

5. 

If is a decision rule then the decision rule ^ ^ ^ will be called an 

inverse decision rule of ^ 

Let L>ec*(5) denote the set of all inverse decision rules of Dec{S). 

It can be shown that Dec* (5) satisfies I), 2), 3) and 4), i.e., it is an 
decision algorithm in S. 

If Dec{S) is a decision algorithm then Dec* (5) will be called an inverse 
decision algorithm of Dec(5). 

The number 

T]{Dec{S)) = ^ max{(7s{^,^)}^eD(^) 

^^^eDec(S) 

where D{^) = \ ^ ^ ^ ^ Dec(5)} will be referred to as the efficiency 

of the decision algorithm Dec{S) in 5, and the sum is stretching over all 
decision rules in the algorithm. 

The efficiency of a decision algorithm is the probability (ratio) of all ob- 
jects of the universe, that are classified to decision classes, by means of de- 
cision rules ^ ^ ^ with maximal strength 0 - 5 ( 5 ?, 1?^) among rules 5? ^ 5? G 
Dec[S) with satisfied ^ on these objects. In other words, the efficiency says 
how well the decision algorithm classifies objects when the decision rules with 
maximal strength are used only. 

5 Decision algorithms and approximations 

Decision algorithms can be used as a formal language for describing approx- 
imations (see [5]). 

Let Dec{S) be a decision algorithm in S and let 5? — > I?" G Dec{S). By 
C{^) we denote the set of all conditions of W in Dec (5) and by D(^) - the 
set of all decisions of 5? in Dec{S). 

Then we have the following relationships: 



a)a(||a^||5) = 


II V ^'11^’ 






b) ^*(11-1^115) = 


II V ^'11^ 
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c) BiVc(||!P||s) = || V ^'11^- 

Prom the above properties we can get the following definitions: 

i) If \ \^\\s = then formula ^ will be called the C-lower approxi- 

mation of the formula ^ and will be denoted by 

ii) If ||^^||s = then the formula ^ will be called the C-upper 

approximation of the formula ^ and will be denoted by (7*(!^); 

iii) If 1 1^1 Is = BNciW^Ws)^ then ^ will be called the C-boundary of the 
formula ^ and will be denoted by BNc{^)- 

The above properties say that any decision ^ € Dec{S) can be uniquely 
described by the following certain and uncertain decision rules respectively: 

C,{^) ^ 

BNc{^) 

This property is an extension of some ideas given by Ziarko [16]. The approx- 
imations can also be defined more generally, as proposed in [15] by Ziarko, 
and consequently we obtain more general probabilistic decision rules. 



6 Some properties of decision algorithms 



Decision algorithms have interesting probabilistic properties which are dis- 
cussed in this section. 

Let Dec{S) be a decision algorithm and let ^ ^ E Dec{S). Then the 

following properties are valid: 

^ = 1 ( 1 ) 
= 1 ( 2 ) 



7T50?)= ^ 7Ts{^\^') * 7Ts{^^) 



'Ksi^W) = 






( 3 ) 

( 4 ) 



That is, any decision algorithm, and consequently any decision table, sat- 
isfies (1), (2), (3) and (4). Observe that (3) is the well known Total Probability 
Theorem and (4) is the Bayes ^ Theorem. Note that we are not referring to 
prior and posterior probabilities - fundamental in Bayesian data analysis phi- 
losophy. The Bayes’ Theorem in our case says that: if an implication ^ 
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is true in the degree then the inverse implication ^ is true in 

the degree 7ts{^\^). 

Let us observe that the Total Probability Theorem can be presented in 
the form 



ns{^)= E 



and the Bayes’ Theorem will assume the form 



7Ts{^\^) = 






TTsi^) 



(5) 

( 6 ) 



Thus in order to compute the certainty and coverage factors of decision rules 
according to formula (6) it is enough to know the strength (support) of all 
decision rules in the decision algorithm only. The strength of decision rules 
can be computed from the data or can be a subjective assessment. 

In other words, if we know the ratio of in thanks to the Bayes’ 
Theorem, we can compute the ratio of in 



7 Illustrative examples 

In this section we will illustrate the concepts introduced previously by means 
of simple tutorial examples. 

Example 1 

Let us consider Table 1 in which data on the relationships between color 
of eyes and color of hair is given. 



Table 1. Simple data table 



Eyes 


Hair | 


blond 


dark 


blue 

brown 


16 

8 


0 

56 



The above data can be presented as a decision table shown in Table 2. 
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Table 2. Decision table 



Rule 

number 


Eyes 


Hair 


Support 


1 


blue 


blond 


16 


2 


blue 


dark 


0 


3 


hazel 


blond 


8 


4 


hazel 


dark 


56 



Assume that Hair is a decision attribute and Eyes is a condition attribute. 
The corresponding decision algorithm is given below: 

1) if {Eyes j blue) then {Hair, blond), 

2) if {Eyes, blue) then {Hair, darl^, 

3) if {Eyes, hazel) then {Hair, blond), 

4) if {Eyes, hazel) then {Hair, darJ^. 

The certainty and coverage factors for the decision rules are given in Table 

3. 



Table 3. Certainty and coverage factors 



Rule 

number 


Cert. 


Cov. 


Support 


Strength 


1 


1.000 


0.67 


16 


0.2 


2 


0.000 


0.00 


0 


0.0 


3 


0.125 


0.33 


8 


0.1 


4 


0.875 


1.00 


56 


0.7 



From the certainty factors of the decision rules we can conclude that: 

- every person in the data table having blue eyes is for certain a blond, 

- for certain there are no people in the data table having blue eyes who are 
dark-haired, 

- the probability that a person having hazel eyes is a blond is 0.125, 

- the probability that the person having hazel eyes is dark-haired equals to 
0.875. 



In other words the decision algorithm says that: 
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- 12,5% persons with hazel eyes are blond, 

- 87,5% persons with hazel eyes are dark-haired, 

- 100% persons with blue eyes are blond. 

Prom the above we can conclude that: 

- people with hazel eyes are most probably dark-haired, 

- people with blue eyes are for certain blond. 

The efficiency of the decision algorithm is 0.9. 

The inverse decision algorithm is given below: 

1’) if {Hair, blond) then {Eyes, blue), 

2’) if {Hair, dark) then {Eyes, blue), 

3’) if {Hair, blond) then {Eyes, hazel), 

4’) if {Hair, dark) then {Eyes, hazel). 

The coverage factors says that: 

- the probability that a blond has blue eyes is 0.67, 

- for certain there are no dark-haired people in the data table having blue 
eyes, 

- the probability that a blond has brown eyes is 0.33, 

- for certain every dark-haired person in the data table has hazel eyes. 

In other words: 

- 33% blond have hazel eyes, 

- 67% blond have blue eyes, 

- 100% dark-haired persons have hazel eyes. 

Thus we can conclude that: 

- blond have most probably blue eyes, 

- dark-haired people have for ceratin hazel eyes. 

The efficiency of the inverse decision algorithm is 0.9. 

Example 2 

In Table 4 information about nine hundred people is represented. The 
population is characterized by the following attributes: Height, Hair, Eyes 
and Nationality. 
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Table 4. Characterization of nationalities 



u 


Height 


Hair 


Eyes 


Nationality 


Support 


1 


tall 


blond 


blue 


Swede 


270 


2 


medium 


dark 


hazel 


German 


90 


3 


medium 


blond 


blue 


Swede 


90 


4 


tall 


blond 


blue 


German 


360 


5 


short 


red 


blue 


German 


45 


6 


medium 


dark 


hazel 


Swede 


45 



Suppose that Height^ Hair and Eyes are condition attributes and Na- 
tionality is the decision attribute, i.e., we want to find description of each 
nationality in terms of condition attributes. 

Below a decision algorithm associated with Table 4 is given: 

1) if {Height, tall) then {Nationality, Swede), 

2) if {Height, medium) and {Hair, darli) then {Nationality, German), 

3) if {Height, medium) and {Hair, blond) then {Nationality, Swede), 

4) if {Height, tall) then {Nationality, German), 

5) if {Height, short) then {Nationality, German), 

6) if {Height, medium) and {Hair, darli) then {Nationality, Swede). 

The certainty and coverage factors for the decision rules are shown in Table 

5. 



Table 5. Certainty and coverage factors 



Rule 

number 


Cert. 


Gov. 


Support 


Strength 


1 


0.43 


0.67 


270 


0.3 


2 


0.67 


0-18 


90 


0.1 


3 


1.00 


0.22 


90 


0.1 


4 


0.57 


0.73 


360 


0.4 


5 


1.00 


0.09 


45 


0.05 


6 


0.33 


0.11 


45 


0.05 



From the certainty factors of the decision rules we can conclude that: 
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- 43% tall people are Swede, 

- 57% tall people are German, 

- 33% medium and dark-haired people are Swede, 

- 67% medium and dark-haired people are German, 

- 100% medium and blond people are Swede, 

- 100% short people are German. 

Summing up: 

- tall people are most probably German, 

- medium and dark-haired people are most probably German, 

- medium and blond people are for certain Swede, 

- short people are for certain German. 

The efficiency of the above decision algorithm is 0.65. 

The inverse algorithm is as follows: 

r) if {Nationality, Swede) then {Height, tall), 

2’) if {Nationality, German) then {Height, medium) and {Hair, dark), 

3’) if {Nationality, Swede) then {Height, medium) and {Hair, blond), 

4’) if {Nationality, German) then {Height, tall), 

5’) if {Nationality, German) then {Height, short), 

6’) if {Nationality, Swede) then {Height, medium) and {Hair, dark). 

From the coverage factors we get the following characterization of nationali- 
ties: 



- 11% Swede are medium and dark-haired, 

- 22% Swede are medium and blond, 

- 67% Swede are tall, 

- 9% German are short, 

- 18% German are medium and dark-haired, 

- 73% German are tall. 

Hence we conclude that: 

- Swede are most probably tall, 

- German are most probably tall. 

The efficiency of the inverse decision algorithm is 0.7. 

Observe that there are no certain decision rules in the inverse decision 
algorithm nevertheless it can properly classify 70% objects. 

Of course it is possible to find another decision algorithm from Table 4. 
Observe that there are three methods of computation of the certainty and 
coverage factors: either directly from definition employing the data, or using 
formula (4) or (6). 

Similarly, 7 t^(^^) can be computed in three ways: using the definition and 
the data, or formula (3) or (5). 
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The obtained results are valid for the data only. In the case of another 
bigger data set the results may not be valid anymore . 

Whether they are valid or not it depends if Table 4 is a representative 
sample of a bigger population or not. 

Example 3 

Now we will consider an example taken from [12] , which will show clearly 
the difference between the Bayesian and rough set approach to data analysis. 

We will start from the data table presented below: 



Table 6. Voting Intentions 



Y 2 Ys 


1 2 3 4 


1 1 


28 8 7 0 


2 


153 114 53 14 


3 


20 31 17 1 


2 1 


1 10 1 


2 


165 86 54 6 


3 


30 57 18 4 



where Yi represents Voting Intentions (1 = Conservatives, 2 = Labour, 
3 = Liberal Democrat, 4 = Others), Y 2 represents Sex (1 = male, 2 = female) 
and F3 represents Social Class (1 = high, 2 = middle, 3 = low). 

Remark. In the paper [12] wrongly 1 = low and 3 = high instead of 1 = 
high and 3 = low. 

We have to classify voters according to their Voting Intentions on the 
basis of Sex and Social Class. 

First we create from Table 6 a decision table shown in Table 7: 
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Table 7. Voting Intentions 



u 


^2 


V3 




Support 


Strength 


1 


1 


1 


1 


28 


0.03 


2 


1 


1 


2 


8 


0.01 


3 


1 


1 


3 


7 


0.01 


4 


1 


2 


1 


153 


0.18 


5 


1 


2 


2 


114 


0.13 


6 


1 


2 


3 


53 


0.06 


7 


1 


2 


4 


14 


0.02 


8 


1 


3 


1 


20 


0.02 


9 


1 


3 


2 


31 


0.04 


10 


1 


3 


3 


17 


0.02 


11 


1 


3 


4 


1 


0.00 


12 


2 


1 


1 


1 


0.00 


13 


2 


1 


2 


1 


0.00 


14 


2 


1 


4 


1 


0.00 


15 


2 


2 


1 


165 


0.19 


16 


2 


2 


2 


86 


0.10 


17 


2 


2 


3 


54 


0.06 


18 


2 


2 


4 


6 


0.01 


19 


2 


3 


1 


30 


0.03 


20 


2 


3 


2 


57 


0.07 


21 


2 


3 


3 


18 


0.02 


22 


2 


3 


4 


4 


0.00 



Next we simplify the decision table by employing only the decision rules 
with maximal strength, and we get the decision table presented in Table 8. 
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Table 8. Simplified Decision Table 



u 


Y 2 


n 




Support 


Strength 


1 


1 


1 


1 


28 


0.07 


2 


1 


2 


1 


153 


0.35 


3 


1 


3 


2 


31 


0.07 


4 


2 


2 


1 


165 


0.38 


5 


2 


3 


2 


57 


0.13 



It can be easly seen that the set of condition attributes can be reduct 
(see[4]) and the only reduced is the attribute I3 (Social Class). 

Thus Table 8 can be repleaced by Table 9 



Table 9. Reduced Decision Table 



U 


Fa 


Fi 


Strength 


Certainty 


Coverage 


1 


1 


1 


0.07(0.03) 


1.00(0.60) 


0.10(0.07) 


2 


2 


1 


0.73(0.37) 


1.00(0.49) 


0.90(0.82) 


3 


3 


2 


0.20(0.11) 


1.00(0.55) 


1.00(0.31) 



The numbers in parenthesis refer to Table 7. 

Prom this decision table we get the following decision algorithm: 

cer. 

1. high class — )• Conservative party 0.60 

2. middle class Conservative party 0.49 

3. lower class — y Labour party 0.55 

The efficiency of the decision algorithm is 0.51. 

The inverse decision algorithm is given below: 

cer. 

V. Conservative party high class 0.07 
2\ Conservative party — y middle class 0.82 
3’. Labour party — y lower class 0.31 

The efficiency of the inverse decision algorithm is 0.48. 

From the decision algorithm and the inverse decision algorithm we can 
conclude the following: 
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- 60% high class and 49% middle class intend to vote for the Conservative 
party 

-55% lower class intend to vote for the Labour party 

- 7% intend to vote for the Conservative party belong to the high class 

- 82% intend to vote for the Conservative party belong to the middle class 
-31% intend to vote for the Labour party belong to the lower class 

We advise the reader to examine the approach and results presented in [12] 
and compare them with that shown here. 

Clearly, the rough set approach is much simpler and given better results 
then that discussed in [12]. 

8 Conclusions 

The notion of a decision algorithm has been defined and its connection with 
decision table and other basic concepts of rough set theory discussed. Some 
probabilistic properties of decision algorithms have been revealed, in par- 
ticular the relationship with the Total Probability Theorem and the Bayes’ 
Theorem. These relationships give a new efficient method to draw conclusions 
from data, without referring to prior and posterior probabilities intrinsically 
associated with Bayesian reasoning. 
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The past two decades have witnessed a dramatic growth in the use of probability- 
based methods in a wide variety of applications centering on automation of 
decision-making in an environment of uncertainty and incompleteness of infor- 
mation. 

Successes of probability theory have high visibility. But what is not widely 
recognized is that successes of probability theory mask a fundamental limitation 
- the inability to operate on what may be called perception-based information. 
Such information is exemplified by the following. Assume that I look at a box 
containing balls of various sizes and form the perceptions: (a) there are about 
twenty balls; (b) most are large; and (c) a few are small. The question is: What 
is the probability that a ball drawn at random is neither large not small? Proba- 
bility theory cannot answer this question because there is no mechanism within 
the theory to represent the meaning of perceptions in a form that lends itself to 
computation. The same problem arises in the examples: 

Usually Robert returns from work at about 6 pm. What is the probability 
that Robert is home at 6:30 pm? I do not know Michelle’s age but my perceptions 
are: (a) it is very unlikely that Michelle is old; and (b) it is likely that Michelle is 
not young. What is the probability that Michelle is neither young nor old? X is a 
normally distributed random variable with small mean and small variance. What 
is the probability that X is large? Given the data in insurance company database, 
what is the probability that my car may be stolen? In this case, the answer 
depends on perception-based information which is not in insurance company 
database. 

In these simple examples - examples drawn from everyday experiences - 
the general problem is that of estimation of probabilities of imprecisely defined 
events, given a mixture of measurement-based and perception-based informa- 
tion. The crux of the difficulty is that perception-based information is usually 
described in a natural language - a language which probability theory cannot 
understand and hence is not equipped to handle. 

To endow probability theory with a capability to operate on perception- 
based information, it is necessary to generalize it in three ways. To this end, 
let FT denote standard probability theory of the kind taught in university-level 
courses. The three modes of generalization are labeled: (a) f-generalization; (b) 
f.g-generalization: and (c) nl-generalization. More specifically: (a) f-generalization 
involves fuzzification, that is, progression from crisp sets to fuzzy sets, lead- 
ing to a generalization of FT which is denoted as FT+. In FT+, probabilities, 
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functions, relations, measures and everything else are allowed to have fuzzy de- 
notations, that is, be a matter of degree. In particular, probabilities described 
as low, high, not very high, etc. are interpreted as labels of fuzzy subsets of 
the unit interval or, equivalently, as possibility distributions of their numerical 
values, (b) f.g-generalization involves fuzzy granulation of variables, functions, 
relations, etc., leading to a generalization of PT which is denoted as PT-h-h- By 
fuzzy granulation of a variable, X, what is meant is a partition of the range of 
X into fuzzy granules, with a granule being a clump of values of X which are 
drawn together by indistinguishability, similarity, proximity, or functionality. 
For example, fuzzy granulation of the variable Age partitions its vales into fuzzy 
granules labeled very young, young, middle-aged, old, very old, etc. Membership 
functions of such granules are usually assumed to be triangular or trapezoidal. 
Basically, granulation reflects the bounded ability of the human mind to resolve 
detail and store information, (c) Nl-generalization involves an addition to PT-h-h 
of a capability to represent the meaning of propositions expressed in a natural 
language, with the understanding that such propositions serve as descriptors of 
perceptions. Nl-generalization of PT leads to perception-based probability the- 
ory denoted as PTp. 

An assumption which plays a key role in PTp is that the meaning of a 
proposition, p, drawn from a natural language may be represented as what is 
called a generalized constraint on a variable. More speciflcally, a generalized 
constraint is represented as X isr R, where X is the constrained variable; R 
is the constraining relation; and isr, pronounced ezar, is a copula in which r 
is an indexing variable whose value defines the way in which R constrains X. 
The principal types of constraints are: equality constraint, in which case isr is 
abbreviated to =; possibilistic constraint, with r abbreviated to blank; veristic 
constraint, with r=v; probabilistic constraint, in which case r=p, X is a random 
variable and R is its probability distribution; random-set constraint, r=rs, in 
which case X is set-valued random variable and R is its probability distribution; 
fuzzy-graph constraint, r=fg, in which case X is a function or a relation and R 
is its fuzzy graph; and usuality constraint, r=u, in which case X is a random 
variable and R is its usual - rather than expected - value. 

The principal constraints are allowed to be modified, qualified, and combined, 
leading to composite generalized constraints. An example is: usually (X is small) 
and (X is large) is unlikely. Another example is: if (X is very small) then (Y is 
not very large) or if (X is large) then (Y is small). 

The collection of composite generalized constraints forms what is referred 
to as the Generalized Constraint Language (GCL). Thus, in PTp, the General- 
ized Constraint Language serves to represent the meaning of perception-based 
information. Translation of descriptors of perceptions into GCL is accomplished 
through the use of what is called the constraint-centered semantics of natural 
languages (CSNL). Translating descriptors of perceptions into GCL is the first 
stage of perception-based probabilistic reasoning. 

The second stage involves goal-directed propagation of generalized constraints 
from premises to conclusions. The rules governing generalized constraint prop- 
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agation coincide with the rules of inference in fuzzy logic. The principal rule of 
inference is the generalized extension principle. In general, use of this princi- 
ple reduces computation of desired probabilities to the solution of constrained 
problems in variational calculus or mathematical programming. 

It should be noted that constraint-centered semantics of natural languages 
serves to translate propositions expressed in a natural language into GCL. What 
may be called the constraint-centered semantics of GCL, written as CSGCL, 
serves to represent the meaning of a composite constraint in GCL as a singu- 
lar constraint X isr R. The reduction of a composite constraint to a singular 
constraint is accomplished through the use of rules which govern generalized 
constraint propagation. 

Another point of importance is that the Generalized Constraint Language is 
maximally expressive, since it incorporates all conceivable constraints. A propo- 
sition in a natural language, NL, which is translatable into GCL is said to be 
admissible. The richness of GCL justifies the default assumption that any given 
proposition in NL is admissible. The subset of admissible propositions in NL 
constitutes what is referred to as a precisiated natural language, PNL. The con- 
cept of PNL opens the door to a significant enlargement of the role of natural 
languages in information processing, decision and control. 

Perception-based theory of probabilistic reasoning suggests new problems 
and new directions in the development of probability theory. It is inevitable that 
in coming years there will be a progression from PT to PTp, since PTp enhances 
the ability of probability theory to deal with realistic problems in which decision- 
relevant information is a mixture of measurements and perceptions. 
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Abstract. An approach to a multi-facet task of situation identifica- 
tion by Unmanned Aerial Vehicle (UAV) is presented. The concept of 
multi-layered identification system based on soft computing approach to 
reasoning with incomplete, imprecise or vague information is discussed. 



1 Introduction 

The task of controlling Unmanned Aerial Vehicle (UAV) in the traffic control 
applications arises a plethora of different problems (refer to [11]). Among the 
others the task of identifying the current road situation and deciding whether 
it should be considered as normal or potentially dangerous. The main source 
of information is the video system mounted on board of the UAV. It provides 
us with images of the situation underneath. Those images are gathered by dig- 
ital video cameras working in visual and infrared band. With use of advanced 
techniques coming from the area of Image Processing and Computer Vision it is 
possible to identify and describe symbolically the objects existing within such as 
cars, road borders, cross-roads etc. Once the basic features are extracted from 
the image the essential process of identifying situation starts. The measurements 
taken from the processed image are matched against the set of decision rules. 
The rules that apply are contributing to the taking of final decision. In some 
cases it is necessary to go back to the image since more information should be 
drawn. The set of rules that match the image being processed at the time can 
identify a situation instantly or give us the choice of interpretations of what we 
see. In the latter case we may apply a higher level reasoning scheme in order 
to finally reach the decision. The entire cognition scheme that we construct is 
based on the concept of learning the basic notions and reasoning method form 
the set of pre-classified data (image sequences) that were given to us prior to 
putting the system into live. 

Different tasks in the process of object identification require different learning 
techniques. In many applications, it is necessary to create some complex features 
from simple ones. This observation forces a hierarchical system of identification 
with multi-layered structure. In case of autonomous systems, some parameters 
of this layered structure are determined from experiment data in some learning 
processes called layered learning{^ee [9]). 
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Soft computing is one of many modern methods resolving some problems 
related to complex objects (eg. classification, identification and description) in 
real life systems. We emphasize two major directions: computing with words 
and granular computing aiming to build foundations for approximate layered 
reasoning (see [12], [13]). 

In this paper we present an approach for layered learning based on soft com- 
puting techniques. We illustrate this idea by the problem of road situation identi- 
fication. We also describe a method for automatic reasoning and decision making 
under uncertainty of information about objects (situation, measurement, etc.) 
using our layered structure. 



2 The Problem of Complex Object Identification 

In order to realise what are the problems behind the task of complex object 
classification/identification let us bring a simple example. Lets consider an image 
sequence showing two cars on the road turn, one taking over the other with high 
speed (see Figure 1). Now the question is how to mimick our perception of this 
situation with automatic system. What attributes in the image sequence should 
be taken into account and what is their range? 

We may utilise background knowledge we have gained from human experts as 
well as general principles such as traffic regulations. However, the experts usually 
formulate their opinions in vague terms such us: ”IF there is a turn nearby AND 
the car runs at high speed THEN this situation is rather dangerous”. 

The problem of finding out which attributes are being taken into account 
within our background knowledge is the first step. After that we have to identify 
the ranges for linguistic variables such as: ’Turn nearby”, ”high speed”. The 
only reasonable way to do that is learning by example. The choice of proper 
learning scheme for basic attribute semantics is another problem that have to 
be overcame. 

Yet another complication is the traditional trade-off between exactness and 
effectiveness. In case of real-time systems like UAV we need sometimes to omit 
some attributes since it takes too much time and effort to measure them. The 
proper choice of attributes for a particular situation is crucial, and we cannot 
expect to find a universal one. We should therefore possess a mechanism for 
dynamic update of feature vector as situation below evolves. 

From the very beginning we allow our model to use notions formulated 
vaguely, imprecisely or fuzzy. That triggers the need for constructing the infer- 
ence mechanism that is capable of using such a constructs as well as to maintain 
and propagate the levels of uncertainty. 

The mechanism of propagating uncertainty should work both top-down and 
bottom-up. It is important to be able to get a conclusion with a good level 
of certainty provided we have measurements certain enough. But, it is equally 
important to have possibility of solving this equality other way - having the 
requirements for final answer determine allowable uncertainty level in the lower 
layers of reasoning scheme. And once again learning by example seem to be the 
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best way to do that. Of course, the character of data (spatio-temporal) must be 
utilised. 




Fig. 1. The example of image sequence. 



3 Construction of Identification System 

Construction of Identification Structure includes: 

— Learning of basic concepts (on different layers) from sensor measurements 
and expert knowledge (see [7], [8]). 

— Synthesis of interfaces between different learning layers (in particular, for 
uncertainty coefficients propagation, decomposition of specifications and un- 
certainty coefficients as in [4]). 

The domain knowledge can be presented by levels of concepts. In case of 
UAV, we propose three-layer knowledge structure. 

The first layer is built from the information gained using image processing 
techniques. We can get information about color blobs in the image, contours, 
edges and distances between the objects identified. It is possible with advanced 
computer vision techniques and additional information about placement and 
movements of UAV (see [5], [2]) to get the readings that are highly resistant to 
scaling, rotation and unwanted effects caused by movement of both target and 
the UAV. Information at this stage is exact, but may be incomplete. Some of this 
information, however expressed in exact units may be by definition imprecise e.g. 
car speed estimated from the image analysis. 

The next layer incorporates terms that are expressed vaguely, inexactly. At 
this level we use linguistic variables describing granules of information like ’’Ob- 
ject A moves fast”. We also describe interactions between objects identified. 
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Those interactions such as "Object A is too close to Object B" are vaguely 
expressed as well. 

Third layer is devoted to inference of final response of the system. In the basic 
concept, the inference is based on the set of decision rules extracted from the 
knowledge we have with regard to the set of training examples. In Figure 2 we 
present an example of knowledge structure for "danger overtaking maneuver". 

Very crucial for operation of the entire system is the role of interlayer inter- 
faces. They are responsible for unification of "language" between layers. 

The original features may be measured with use of different techniques and 
units. Therefore it is necessary to unify them. At the first glance this task may 
look simple. What can be difficult in changing pixel to centimeters or degrees? 
But, we have to realise that in most cases subsequent images are affected by 
various distortions caused by e.g. changing angle of view as UAV hovers over the 
scene. Compensation and augmentation of various elements in order to pass a 
unified set of measurements to the next level requires many times the interface 
to perform task such as pattern recognition, intelligent (adjustable) filtering and 
so on. 



IF Carl is close to Car2 AND Carl does not change direction THEN DANGER 



IF Carl changing lane AND Carl does not change speed AND 
road is straight and wide THEN QK 
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Fig. 2. The layered structure. 
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The interface between granule and inference layers is responsible for adjusting 
the output from the process of matching measurements against existing granule 
concepts. In order to be able to apply the decision rules, we have to express the 
setting of granules in a particular situation using the terms used by rules. This 
process includes, among others, the elements of spatio-temporal reasoning, since 
it is the interface that sends the information about changes in mutual placement 
of granules in time and about changes in placement of object within granules. 
This is especially crucial if we intend to make our system adaptive. The ability 
to avoid another error and to disseminate cases that were improperly treated in 
the past strongly relies on interface capabilities. 

The learning tasks are defined in the same way as in Machine Learning. For 
example, the identification problem for UAV can be formulated as follows: given 
a collection of situations on roads: TRAINING.SET = {(-Si,di), ..., {Sn^dn)} 
labeled by decision values (verified by experts f\ 

The main problem of layered learning is how to design the learning schema 
consisting of basic learning algorithms that allow to identify properly new situ- 
ations on the road. 

One can see that the natural approach is to decompose the general, complex 
task into simpler ones. In case of knowledge structure presented in Figure 2 we 
can propose the following learning schema (Figure 3). 

Unfortunately, this simple schema does not work well, because of the occur- 
rence of phenomena characteristic for layered learning, such as: 

1. constrained learning: It is necessary to determine the training data for 
particular learning algorithm in the design step. Usually, training data is pre- 
sented in form of decision table. For example, the decision table for ” Granule 
Learner 1” (see Figure 3) consists of: 

— conditional attributes: ^Aar speed/ f ^Atmosphere condition^ f .... In gen- 
eral, these attributes defined by properties of objects from Measurement 
Layer. 

— decision attribute: in the simplest case, this attribute has two values: 
YES - for positive examples and NO - for negative examples. 

— examples (objects, cases) are taken from situations of TRAINING_SET 
restricted to the conditional attributes. 

the problem comes from the fact that for every situation we have only the 
global information whether this situation is dangerous or not and which rule 
can be used to identify this fact, but we do not have information about par- 
ticular basic concepts thta contribute to that decision. For these situations, 
we only have some constrains between basic concepts. For example, for some 
situations we can have a constrain for "Granule Learner 1" and "Granule 
Learner 2" of form: "for this situation, one of those two learners must give 
a negative answer". 

2. tolerance relation learning: The learning algorithms (Learners, see Fig. 3) 
used are not perfect. The possibility of imprecise solution and learning error 
is their native feature. This raises the question of possible error accumulation 
in consecutive layers. The problem is to set constrains for error rates of both 
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IF Carl is close to Car2 AND 
Carl does not change direction AND 
Carl has high speed 



IF Carl changing lane AND 
Carl does not change speed 
AND road is straight and wide 



THEN DANGER THEN QK 




Fig. 3. Learning in layered structure. 



individual and ensembles of algorithms in order to achieve acceptable quality 
of the overall layered learning process. We can apply the general approach for 
this problem called Rough Mereology (proposed in [4]). This idea is based on 
determining some standard objects (or standards for short) for every concept 
and corresponding tolerance relation in such a way, that if a new situation 
is close enough to some basic concepts, this situation will be close enough to 
higher level concept as well. 

4 Construction of Reasoning Engine 

Reasoning mechanisms should take into account uncertainties resulting from un- 
certainties of sensor measurements, missing sensor values, bounds on resources 
like computation time, robustness with respect to parameter deviation and ne- 
cessity of adaptation to the changing environment. 

In the simplest version the reasoning engine spins around the set of deci- 
sion rules extracted form background knowledge and observation of learning 
examples. The output of matching feature measurements against predetermined 
information granules is fed to that layer thru interlayer interface. The rules that 
match in a satisfactory degree are contributing to final decision. There are two 
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sub-steps in this process. First is the determination of the degree of matching 
for a particular rule. This is done with use of techniques known from rough set 
theory and fuzzy theory. Second is the determination of final decision on the 
basis of matching and not matching rules. This is almost a classical example of 
conflict solving. To perform such a task many techniques have been developed 
within rough set theory itself, as well as in related fields of soft computing. 

Since the entire systems operates in very unstable environment it is very 
likely that the decision is not reached instantly. The inference mechanism re- 
quires more information. The request for more information addresses underlying 
components of the system. To make the decision process faster we should de- 
mand the information that not only allows to find matching rules, but also allows 
elimination of possibly largest set of rules that can be identified as certainly not 
relevant in a given situation. This is the kind of reasoning scheme resembling 
medical doctors approach. The physician first tries to eliminate possibilities and 
then, basing on what remains possible, orders additional examinations that allow 
to make final diagnosis. The very same mechanism in case of situation identifi- 
cation allows to decrease size of inference system and in consequence, improve 
effectiveness. 

Construction of Reasoning Engine includes: 

— Negotiation and dialog methods used for (relevant) concept perception on 
different layers. 

— Spatio-temporal reasoning schemes based on concept perception on different 
layers with application of soft computing methods (using e.g., rough or/and 
fuzzy approaches) to reason about the perception results on different layers. 

Negotiation and dialog methods are core component of interlayer interfaces, 
especially while they operate top-down i.e. process the requests send from the 
inference layer to these below. If in the inference layer no rule match enough 
in the current situation, the necessity of drawing additional information arises. 
The catch is to obtain result with minimum possible effort. We can identify 
within inference layer the set of information granules that, if better described, 
will be enough to take decision. But, on the other hand, getting this particular, 
additional information may be complicated from the granule layer point of view. 
Implicitly, drawing additional information requires also the negotiation between 
granule and measurement layers. 

Within both negotiation processes (inference - granules and granules - mea- 
surements) we have to establish the utility function. Such a utility function, if 
maximised, allow to find the most reasonable set of granules or measurements for 
further processing. Unfortunately, we may not expect to find an explicit form 
of this function. Rather we should try to learn its estimation from the set of 
examples using adaptive approximation methods such as neural networks. 

5 Conclusions and Future Researches 

We presented a proposition of autonomous system, that can learn how to identify 
complex objects from examples. Our idea was illustrated by example of danger- 
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ous situation identification system which can be integrated for UAV project [?]). 
We listed some new characteristic aspects for layered learning methods. In next 
papers weintend to describe a system that learns to make complex decisions. By 
complex decision we mean e.g. the family of action plans, such a system will 
not only tell us whats going on below (danger/no danger), but also recommend 
what urther action should be performed by UAV. This problem is extremely 
interesting for autonomous systems. 
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Abstract. We outline a rough-neuro computing model as a basis for 
granular computing. Our approach is based on rough sets, rough mere- 
ology and information granule calculus. 

Keywords: rough sets, neural networks, granular computing, rough 
mereology 



1 Introduction 

Rough Mereology [4], [7] is a paradigm allowing for a synthesis of main ideas of 
two paradigms for reasoning under uncertainty: Fuzzy Set Theory and Rough 
Set Theory. We present applications of Rough Mereology to the important the- 
oretical idea put forth by Loth Zadeh [11], [12], i.e., Granularity of Knowledge 
by presenting the idea of rough-neuro computing paradigm. 

We emphasize an important property of granular computing related to the ne- 
cessity of lossless compression tuning for complex object constructions. It means 
that we map a cluster of constructions into one representation. Any construction 
in the cluster is delivering objects satisfying the specification in satisfactory de- 
gree if only objects input to synthesis are sufficiently close to selected standards 
(prototypes). In rough mereological approach clusters of constructions are rep- 
resented by the so-called stable schemes (of co-operating agents), i.e., schemes 
robust to some deviations of parameters of transformed granules. In consequence, 
the stable schemes are able to return objects satisfying in satisfactory degree the 
specification not only from standard (prototype) objects but also from objects 
sufficiently close to them [4], [5]. In this way any stable scheme of complex object 
construction is a representation of a cluster of similar constructions from clusters 
of elementary objects. 

We extend schemes for synthesis of complex objects (or granules) developed 
in [7] and [5] by adding one important component. As a result we receive the 
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granule construction schemes which can be treated as a generalization of neural 
network models. The main idea is that granules sent by one agent to another 
are not, in general, exactly understandable by the receiving agent because these 
agents are using different languages and usually there is no meaning-preserving 
translation of formulas of the language of the sending agent to formulas of the 
receiving agent. Hence, it is necessary to construct some interfaces which will 
allow to approximately understand received granules. These interfaces can be, 
in the simplest case, constructed on the basis of exchanged information about 
agents stored in the form of decision data tables. From these tables the approx- 
imations of concepts can be constructed using rough set approach [10]. In our 
model we assume that for any agent ag and its operation o{ag) of arity n there 
are approximation spaces ASi{o{ag)^in)^ ...^ASn{o{ag)^in) which will filter (ap- 
proximately) the granules received by the agent for performing the operation 
o{ag). In turn, the granule sent by the agent after performing the operation is 
filtered (approximated) by the approximation space AS{o{ag)^out). These ap- 
proximation spaces are parameterized with parameters allowing to optimize the 
size of neighborhoods in these spaces as well as the inclusion relation [8] using as 
a criterion for optimization the quality of granule approximation. Approximation 
spaces attached to an operation correspond to neuron weights in neural network- 
s whereas the operation performed by the agent corresponds to the operation 
realized on the vector of real numbers by the neuron. The generalized scheme of 
agents is returning a granule in response to input information granules. It can 
be for example a cluster of elementary granules. Hence our schemes realize much 
more general computations then neural networks operating on vectors of real 
numbers. The question, if such schemes can be efficiently simulated by classical 
neural networks is open. 

We would like to call extended schemes for complex object construction 
rough-neuro schemes (for complex object construction). The stability of such 
schemes corresponds to the resistance to noise of classical neural networks. 

In the paper we present in some details the outlined above rough- neuro 
computing paradigm. 



2 Adaptive Calculus of Granules in Distributed Systems 



We now present a conceptual scheme for adaptive calculus of granules aimed 
at synthesizing solutions to problems posed under uncertainty. This exposition 
is based on our earlier analyzes presented in [4], [7]. We construct a scheme of 
agents which communicate by relating their respective granules of knowledge by 
means of transfer functions induced by rough mereological connectives extracted 
from their respective information systems. We assume the notation of [7] where 
the reader will find all the necessary information. 

We now define formally the ingredients of our scheme of agents. 
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2.1 Distributed Systems of Agents 

We assume that a pair (/nu, Ag) is given where Inv is an inventory of elementary 
objects and Ag is a set of inteligent computing units called shortly agents. 

We consider an agent ag G Ag. The agent ag is endowed with tools for 
reasoning about objects in its scope; these tools are defined by components of 
the agent label. The label of the agent ag is the tuple 

lab[ag) = [A{ag)^ M [ag)^ L[ag)^ Linking) ^ APJD[ag)^ St[ag)^ 

Unc-rel[ag)^ H {ag)^ U nc-rule[ag) ^ Dec-rule[ag)) 

where 

1. A{ag) = [U [ag) ^ A[ag)) is an information system of the agent ag] we 
assume as an example that objects (i.e., elements of U[ag)) are granules of 
the form: (o;, [a]) where o; is a conjunction of descriptors (one may have more 
complex granules as objects). 

2. M{ag) = [U{ag)^ [0, 1], go{^g)) is a pre - model of Lrm with a pre - rough 
inclusion fao{ag) on the universe U[ag)] 

3. L[ag) is a set of unary predicates (properties of objects) in a predicate 
calculus interpreted in the set U[ag)] we may assume that formulae of L{ag) are 
constructed as conditional formulae of logics Lb where B C A{ag). 

4. St{ag) = {st{ag)i^ st{ag)jf\ C U {ag) is the set of standard objects at 
ag] 

5. Link{ag) is a collection of strings of the form t = a.giag 2 ...a.gka.g] the in- 
tended meaning of a string agiag 2 ..mgj^ag is that ag 2 ^ agj^ are children of 
ag in the sense that ag can assemble complex objects (constructs) from sim- 
pler objects sent by agi^ag 2 ^ ---^ag]^. In general, we may assume that for some 
agents ag we may have more than one element in Link{ag) which represents the 
possibility of re - negotiating the synthesis scheme. 

We denote by the symbol Link the union of the family {Link{ag) : ag G Ag}. 

6. AP-0{ag) consists of pairs of the form: 

{o{ag, t), {{ASi{o{ag),in), ...,ASn{o{ag),in)),AS{o{ag),out)) 
where o{agA) ^ 0{ag)^ n is the arity of o(a^,t), t = agiag 2 ^.mgkag G Link^ 
ASi{o{agA)p'a) is a parameterized approximation space [10] corresponding to 
the i — th argument of o{agA) and AS{o{agA)^ out) is a parameterized approx- 
imation space [10] for the output of o(a^,t). 

0{ag) is the set of operations at ag] any o{agA) ^ 0{ag) is a mapping of 
the Cartesian product U {ag) x U {ag) x ... x U {ag) into the universe U{ag)] 
the meaning of o{agA) is that of an operation by means of which the agent 
ag is able to assemble from objects xi G P{agi)^X 2 G t/(a^ 2 ),---, ^ P {agu) 

the object 2 : G U {ag) which is an approximation defined by AS{o{agA)^out) 
to o{agA){yi^U 2 ^ •••^Uk) ^ P {ag) where yi is the approximation to Xi defined by 
ASi{o{ag^ t), in). One may choose here either a lower or an upper approximation. 

7. U nc_rel{ag) is the set of uncertainty relations unc.reli of type 
{oi{ag,t),pi{ag),agi,...,agk,ag,iio{agi),...,i^o{agk),l^o{ag), 

st{agi)i, ...,st{agk)i, at{ag)i) 

where agiag 2 ^.mgkag G Link{ag)^ Oi{agA) ^ 0{ag) and pi is such that 

(x 2,£2),-, (Xk,£k), (x,£)) 
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holds for Xi G ^ ^/c ^ ^ {^9k) and s'l, fc '2 7 ^ [0^1] iff 

si{^9j)i) = £j for j = 1,2, and /io(^, st{ag)i) = £ for the collection of 
standards st(a^i)^, st(a^ 2 )i 7 •• •? st(a^)^ such that 

Oi{ag ,t){st{agi)i, st{ag 2 )i, st{ag£i) = st{ag)i. 

The operation o performed by here is more complex then that of [7] as it is 
composed of three stages: first, approximations to input objects are constructed, 
next the operation is performed, and finally the approximation to the result is 
constructed. Relations uncjreli provide a global description of this process; in 
reality, they are composition of analogous relations corresponding to the three 
stages. As a result, unc-veli depend on parameters of approximation spaces. 
This concerns also other constructs discussed here. It follows that in order to 
get satisfactory decomposition (similarly, uncertainty and so on) rules one has 
to search for satisfactory parameters of approximation spaces (this is in analogy 
to weight tuning in neural computations). 

Uncertainty relations express the agents knowledge about relationships a- 
mong uncertainty coefficients of the agent ag and uncertainty coefficients of its 
children. The relational character of these dependencies expresses their inten- 
sionality. 

8. U nc£Tule{ag) is the set of uncertainty rules uncjrulej of type 
{oj{ag,t),fj,agi, ag 2 ,. ■■,agk, ag, st{agi), st{ag 2 ) , ■■■, st{ag£ , st{ag), 
go{agi),... ,g^o{agk),go{ag)) 

of the agent ag where agiag 2 ---agkag € Link{ag) and fj : [0,1]^ — ^ [0,1] is a 
function which has the property that 

if Oj{ag,t){st{agi), st{ag 2 ), ...,st{agk)) = st{ag) and 
xi G (J(agi),X2 G U (ag2), •••,Xk G U {agk) 

satisfy the conditions st(a^i)) > £{agi) for i = 1, 2, 



then fJo{oj{ag,t){xi,X2,...,Xk),st{ag)) > fj{£{agi),£{ag2),..,£{agk))^ 



Uncertainty rules provide functional operators (approximate mereological 
connectives) for propagating uncertainty measure values from the children of 
an agent to the agent; their application is in negotiation processes where they 
inform agents about plausible uncertainty bounds. 

9. H{ag) is a strategy which produces uncertainty rules from uncertainty 
relations; to this end, various rigorous formulas as well as various heuristics can 
be applied among them the algorithm presented in Section 2.8 of [7]. 

10. Dec_rule{ag) is a set of decomposition rules decjr-ulci of type 

{0i{ag,t),agi,ag2, ...,agk,ag) 

such that {<P[agi)^<P{ag2)^ G dec_rulei (where <P[agi) G L(a^i), 

<P{ag2) G L{ag2), ..., ^{agk) G L{agk), ^{ag) G L{ag) and agiag 2 ...agkag G 
Link[ag)) and there exists a collection of standards st(a^i), st{ag2)^^.^^ st^agj.)^ 
st[ag) with the properties that 
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Oj{ag,t){st{agi), st{ag 2 ), st{agk)) = st{ag), 
st[agi) satisfies 4*{agi) for i = 1,2,..,/^ and st[ag) satisfies <P[ag). 



Decomposition rules are decomposition schemes in the sense that they de- 
scribe the standard st[ag) and the standards st{agj^) from which the 

standard st[ag) is assembled under oi in terms of predicates which these stan- 
dards satisfy. 

We may sum up the content of 1 - 10 above by saying that for any agent 
ag the possible sets of children of this agent are specified and, relative to each 
team of children, decompositions of standard objects at into sets of standard 
objects at the children, uncertainty relations as well as uncertainty rules, which 
relate similarity degrees of objects at the children to their respective standards 
and similarity degree of the object built by ag to the corresponding standard 
object at ag^ are given. 

We take rough inclusions of agents as measures of uncertainty in their re- 
spective universes. We would like to observe that the mereological relation of 
being a part is not transitive globally over the whole synthesis scheme because 
distinct agents use distinct mereological languages. 



2.2 Approximate Synthesis of Complex Objects 

The process of synthesis of a complex object (signal, action) by the above defined 
scheme of agents consists in our approach of the two communication stages viz. 
the top - down communication/negotiation process and the bottom - up com- 
munication/assembling process. We outline the two stages here in the language 
of approximate formulae. 



Approximate logic of synthesis For simplicity of exposition and to avoid 
unnecessarily tedious notation, we assume that the relation ag^ < ag^ which 
holds for agents ag\ag G Ag iff there exists a string agiag 2 --^agkag G Link{ag) 
with ag^ = agi for some i < orders the set Ag into a tree. We also assume 
that 0[ag) = {o[ag^t)} for ag G Ag i.e. each agent has a unique assembling 
operation for a unique t. 

The process of synthesis of a complex object (signal, action) by the above 
defined scheme of agents consists in our approach of the two communication 
stages viz. the top - down communication/negotiation process and the bottom 
- up communication process. We outline the two stages here in the language of 
approximate formulae. To this end we build a logic L{Ag) (cf. [7]) in which we 
can express global properties of the synthesis process. We recall our assumption 
that the set Ag is ordered into a tree by the relation ag' < ag. 

Elementary formulae of L[Ag) are of the form {st{ag)^<P[ag)^s{ag)) where 
st[ag) G St[ag)^^{ag) G L{ag)^e{ag) G [0,1] for any ag G Ag. Formulae of 
L[ag) form the smallest extension of the set of elementary formulae closed under 
propositional connectives V, A, -i and under the modal operators [], <> . 
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To introduce a semantics for the logic L[ag)^ we first specify the meaning of 
satisfaction for elementary formulae. The meaning of a formula ^{ag) is defined 
classically as the set [<P{ag)] = {u G U {ag) : u has the property we 

will denote the fact that u G [<P{ag)] by the symbol u |= <P{ag), We extend 
now the satisfiability predicate |= to approximate formulae: for x G U{ag)^ we 
say that x satifies an elementary formula {st{ag)^<P[ag)^£{ag))^ in symbols: 
X \=< st{ag)^<P[ag)^£{ag) >, iff (i) st[ag) \= 4*{ag) and (ii) /io(a^)(x, st(a^)) > 
£(ag). 

We let 

(hi) X \= ^{st[ag) ^ <P[ag) ^ £{ag)) iff it is not true that x \= {st{ag)^<P[ag)^£{ag)); 
(iv) X 1= {st{ag)i,^{ag)i,e{ag)i) 'V{st{ag)2,^{ag)2,£{ag)2) iff 

X 1= {st{ag)i,^{ag)i,e{ag)i) or x \= {st{ag) 2 ,^{ag) 2 ,e{ag) 2 ). 

In order to extend the semantics over modalities, we first introduce the notion 
of a selection: by a selection over we mean a function sel which assigns to 
each agent ag an object sel{ag) G U{ag), 

For two selections sel^seV we say that sel induces sel\ in symbols sel ^Ag 
sel' when sel{ag) = seV{ag) for any ag G Leaf[Ag) and 

sel'{ag) = o{ag,t){sel\agi), sel\ag 2 ), sel'{ag£) 
for any agiag 2 .^.agkag G Link. 

We extend the satisfiability predicate |= to selections: for an elementary for- 
mula (st(a^), <P[ag)^ s’(a^)), we let sel \= {st{ag)^<P[ag)^£{ag)) iff sel{ag) |= 
{st{ag)A{ag),£{ag)). 

We now let sel |=<>< st{ag)^<P[ag)^£{ag) > when there exists a selection 
sel' satisfying the conditions: 

(i) sel ^Ag sel'; (ii) sel' |= {st{ag)A{ag),£{ag)). 

In terms of logic L[Ag) it is posible to express the problem of synthesis of 
an approximate solution to the problem posed to the team Ag. We denote by 
head[Ag) the root of the tree {Ag^ <). 

In the process of top - down communication, a requirement received by the 
scheme from an external source (which may be called a customer) is decomposed 
into approximate specifications of the form {st{ag)^L*[ag)^£{ag)) for any agent 
ag of the scheme. The decomposition process is initiated at the agent head[Ag) 
and propagated down the tree. 

We are able now to formulate the synthesis problem. 

Synthesis problem 

Given a formula a : {st{head{Ag))^L{head{Ag))^£[head{Ag))) find a selec- 
tion sel over the tree {Ag,<) with the property sel \=<> ex. 

A solution to the synthesis problem with a given formula 

{st[head[Ag)) ^<P[head[Ag)) ^ £[head[Ag))) 

is found by negotiations among the agents; negotiations are based on uncertain- 
ty rules of agents and their succesful result can be expressed by a top-down 
recursion in the tree (A^, <) as follows: given a local team agiag 2 -^mg]^ag with 
the formula {st{ag)^L{ag)^£{ag)) already chosen in negotiations on a higher tree 
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level, it is sufficient that each agent agi choose a standard st{agi) G U{agi)^ a 
formula ^{agi) G L{agi) and a coefficient £{agi) G [0, 1] such that 

(v) C Dec-rule{ag) with standards st{ag)^ 
st{agi),. ,.,st{agk); 

(vi) /(^(a^i), ^ where / satisfies unc-rule{ag) with st{ag)^ 

st{agi), , st{agk) and e(affi), ,.,e:{agk), £{ag). 

For a formula a : {st{head{Ag))^<P[head{Ag))^ s[head{Ag)))^ we call an 
a - scheme an assignment of a formula a{ag) : {st{ag)^<P{ag)^e{ag)) to each 
ag G Ag in such manner that (v), (vi) above are satisfied and a{head{Ag)) 
is {st{head{Ag))^<P{head{Ag))^ e{head{Ag)))] we denote this scheme with the 
symbol 



sch{{st{head{Ag)) ^d>{head{Ag)) ^ e{head{Ag)))) , 

We say that a selection sel is compatible with a scheme 

sch{{st{head[Ag)) ^ <P[head{Ag)) ^ s[head{Ag)))) 

when fao{agA){sel{ag)^ st[ag)) > s[ag) for each leaf agent ag G Ag where 
{st{ag)^^[ag)^£{ag)) is the value of the scheme at ag for any leaf ag G Ag. 

Any leaf agent realizes its approximate specification by choosing in the subset 
Inv D U [ag) of the inventory of primitive constructs a construct satisfying the 
specification. 

The goal of negotiations can be summarized now as follows. 

Proposition 3.1 

For a given a requirement {st{head{Ag))^<P[head{Ag))^ £{head[Ag))) we have: 
if a selection sel is compatible with a scheme 

sch{{st{head{Ag))^<P{head[Ag))^ £{head[Ag)))) 

then sel \=<> {st{head{Ag))^<P[head{Ag))^ £{head[Ag))). 



The bottom-up communication consists of agents sending to their parents 
the chosen constructs. The root agent root[Ag) assembles the final construct. 

3 Conclusions 

We have outlined a general scheme for rough neuro-computation based on ideas 
of knowledge granulation by rough mereological tools. An important practical 
problem is a construction of such schemes (networks) for rough-neuro computing 
and of algorithms for parameter tuning. We now foresee two possible approaches: 
the one in which we would rely on new, original decomposition, synthesis and 
tuning methods in analogy to [7] but in the presence of approximation spaces; 
the second, in which a rough-neuro computing scheme would be encoded by a 
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neural network in such a way that optimalization of weights in the neural net 
leads to satisfactory solutions for the rough-neuro computing scheme (cf. [3] for 
a first attempt in this direction) . 
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Abstract. The aim of the paper is to present basic notions related to 
granular computing namely the information granule syntax and seman- 
tics as well as the inclusion and closeness (similarity) relations of gran- 
ules. In particular, we discuss how to define approximation of complex 
granule sets using the above notions. 



1 Introduction 

We would like to discuss briefly an example showing a motivation for our work 
[6]. Let us consider a team of agents recognizing the situation on the road. The 
aim is to classify a given situation as, e.g., dangerous or not. This soft speci- 
fication granule is represented by a family of information granules called case 
soft patterns representing cases, like cars are too close. The whole scene (ac- 
tual situation on the road) is decomposed into regions perceived by local agents. 
Higher level agents can reason about regions observed by team of their children 
agents. They can express in their own languages features used by their children. 
Moreover, they can use new features like attributes describing relations between 
regions perceived by children agents. The problem is how to organize agents 
into a team (having, e.g., tree structure) with the property that the informa- 
tion granules synthesized by the team from input granules (being perceptions of 
local agents from sensor measurements) will identify the situation on the road 
in the following sense: the granule constructed by the team from input granules 
representing the situation on the road is sufficiently close to the soft specifica- 
tion granule named dangerous if and only if the situation on the road is really 
dangerous. We expect that if the team is returning a granule sufficiently close 
to the soft specification granule dangerous then also a special case of the soft 
pattern dangerous is identified helping to explain the situation. 

The aim of our project is to develop foundations for this kind of reasoning. 
In particular it is necessary to give precise meaning to the notions like: infor- 
mation granules, soft information granules, closeness of information granules in 
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satisfactory degree, information granules synthesized by team of agents etc. The 
presented paper realizes the first step toward this goal. 

The general scheme is depicted in Figure 1. 




To sum up, we consider a set of agents Ag. Each agent is equipped with 
some approximation spaces (defined using rough set approach [1]). Agents are 
cooperating to solve a problem specified by a special agent called customer- agent 
The result of cooperation is a scheme of agents. In the simplest case the scheme 
can be represented by a tree labeled by agents. In this tree leaves are delivering 
some information granules (representing of perception in a given situation by 
leaf agents) and any non-leaf agent ag G Ag is performing an operation o [ag) on 
approximations of granules delivered by its children. The root agent returns an 
information granule being the result of computation by the scheme on granules 
delivered by leaf agents. It is important to note that different agents use different 
languages. Thus granules delivered by children agents to their father can be 
usually perceived by him in an approximate sense before he can perform any 
operation on delivered granules. 

In particular, we point out in the paper a problem of approximation of in- 
formation granule sets and we show that the first step toward such a notion is 
similar to the classical rough set approach. 
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2 Syntax and Semantics of Information Granules 

In this section we will consider several examples of information granule construc- 
tions. We present now the syntax and semantics of information granules. In the 
following section we discuss the inclusion and closeness relations for granules. 
Elementary granules. In an information system IS = ([/, A), elementary 
granules are defined by EFb (x) , where EFb is a conjunction of selectors of the 
form a = a{x) , B C A and x ^ U. For example, the meaning of an elementary 
granule a = lA6=lis defined by 

||a = 1 A 6 = l||j5 = {x e U : a{x) =1 k b{x) = 1} . 

Sequences of granules. Let us assume that 5 is a sequence of granules and 
the semantics ||•||75 in IS of its elements have been defined. We extend ||•||/5 
on S by ||5 ||j5 = * 

Example 1. Granules defined by rules in information systems are examples of 
sequences of granules. Let IS be an information system and let (a,/3) be a 
new information granule received from the rule if a then /3 where a,/3 are 
elementary granules of 75. The semantics \\{ct,P)\\is of is the pair of sets 

(Mis, M rs)- 

Sets of granules. Let us assume that a set G of granules and the semantics 
in IS for granules from G have been defined. We extend ||•||75 on the 
family of sets H C G by \\H\\js = ‘ ^ 

Example 2. One can consider granules defined by sets of rules. Assume that there 
is a set of rules RuleSet = {{ai,pi) : i = 1, . . . ,k} . The semantics of RuleJSet 
is defined by 



\\RuleJ5et\\js = {||(ai,/3i)||^s : * = 1, . . . , fc} . 

Example 3. One can also consider as set of granules a family of all granules 
(a, Rule-Set {DTa )) , where a belongs to a given subset of elementary granules. 

Example 4- Granules defined by sets of decision rules corresponding to a given 
evidence are also examples of sequences of granules. Let DT = {U,A\J {d}) be 
a decision table and let a be an elementary granule of IS = {U, A) such that 
||a||^^ ^ 0. Let Rule-Set {DTa) be the set of decision rules (e.g. in minimal 
form) of the decision table DTa = (||o:||75 , A U {d}) being the restriction of DT 
to objects satisfying a. We obtain a new granule {a, RuleJSet {DTa)) with the 
semantics 



\\{a, RuleJSet {DTa))\\^j. = (||a||j5 , ||i?«ie-5e« (£>To)||p3,). 

This granule describes a decision algorithm applied in the situation characterized 
by a. 

Extension of granules defined by tolerance relation. We present examples 
of granules obtained by application of a tolerance relation. 
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Example 5. One can consider extension of elementary granules defined by toler- 
ance relation. Let IS = {U, A) be an information system and let r be a tolerance 
relation on elementary granules of IS. Any pair (a^r) is called a r- elementary 
granule. The semantics ||(a, r)||j^ of (a,r) is the family {||/3||j5 : (/3,a) G r}. 

Example 6. Let us consider granules defined by rules of tolerance information 
systems. Let IS = (t/, A) be an information system and let r be a tolerance 
relation on elementary granules of IS. If if a then /3 is a rule in IS then the 
semantics of a new information granule (r : a,/3) is defined by ||(r : OLyp)\\j^ = 
\\ia,T)\\jsX\\il3,T)\\js. 

Example 7. We consider granules defined by sets of decision rules corresponding 
to a given evidence in tolerance decision tables. Let DT = (27, A U {d}) be a 
decision table and let r be a tolerance on elementary granules of IS = (27, A). 
Now, any granule {a, RuleSet {DTa)) can be considered as a representative of 
information granule cluster (r : {a^RuleJSet (DT^))) with the semantics 

II (r : [oi, Rule -Set {DT(x)))\\jyrp = RuleSet : (/3,Ck) G r}. 

Labeled graph granules. We discuss graph granules and labeled graph gran- 
ules as notions extending previously introduced granules defined by tolerance 
relation. 

Example 8. Let us consider granules defined by pairs ((7, E) , where G is a finite 
set of granules and E C G x G. Let IS = (27, A) be an information system. 
The semantics of a new information granule {G,E) is defined by ||(G, £?)||j 5 = 
(I|G||, 5 ,I|£^II/ 5 ). where ||G||/5 = {|M|/5 : 5 G G} and ||£||rs = {(llffll, lls'll) : 
(9,9') € E}. 

Example 9. Let G be a set of granules. Labeled graph granules over G are defined 
by (X, /, h) , where / : X G and h : E P {G x G). We also assume one 

additional condition 

if (x,y) 6 E then (f{x),f{y)) e h{x,y) . 

The semantics of labeled graph granule (X, E, /, h) is defined by 

{(ll/WII/5,||M*,y)ll/5,||/(y)||/5) : ix,y)€E}. 

Let us summarize the above presented considerations. One can define the set 
of granules G as the least set containing a given set of elementary granules Go and 
closed with respect to the defined above operations of new granule construction. 
We have the following examples of granule construction rules: 

Qi, . . . , Qft- elementary granules 
{ai,... ,Qfe}- granule 



ai,a 2 - elementary granules 
(ai,a 2 )- granule 
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a- elementary granule ,r- tolerance relation on elementary granules 

(r : a)- granule 

G- a finite set of granules ,E C G x G 
(G, E)- granule 

Let us observe that in case of granules constructed with application of tolerance 
relation we have the rule restricted to elementary granules. To obtain a more 
general rule like 

a- graph granule , r- tolerance relation on graph granules 
(r : a)- granule 

it is necessary to extend the tolerance (similarity, closeness) relation on more 
complex objects. We discuss the problem of closeness extension in the following 
section. 

3 Granule Inclusion and Closeness 

In this section we will discuss inclusion and closeness of different information 
granules introduced in the previous section. Let us mention that the choice of 
inclusion or closeness definition depends very much on the area of application and 
data analyzed. This is the reason that we have decided to introduce a separate 
section with this more subjective part of granule semantics. 

The inclusion relation between granules G,G^ of degree at least p will be 
denoted by Vp (G, G') . Similarly, the closeness relation between granules G, G' 
of degree at least p will be denoted by dp (G, G') . By p we denote a vector of 
parameters (e.g. positive real numbers). 

A general scheme for construction of hierarchical granules and their closeness 
can be described by the following recursive meta-rule: if granules of order < k 
and their closeness have been defined then the closeness dp (G, G') (at least in 
degree p) between granules G, G' of order fc -h 1 can be defined by applying an 
appropriate operator F to closeness values of components of G, G', respectively. 

A general scheme of defining more complex granule from simpler ones can be 
explored using rough mereological approach [2]. 

Inclusion and closeness of elementary granules. We have introduced the 
simplest case of granules in information system IS = {U,A) . They are defined 
by EFb (ic) , where EFb is a conjunction of selectors of the form a = a{x) , 
B C A and x ^ U. Let Gjs = {EFb {x) : B C A x ^ U} . In the standard 
rough set model [1] elementary granules describe indiscernibility classes with 
respect to some subsets of attributes. In a more general setting see e.g. [3], [5] 
tolerance (similarity) classes are described. 

The crisp inclusion of a in /3, where a, /3 € {EFb {x) : B C A Sc x E U} is defined 
11^11/5 ^ ll/^ll/s ) where ||o :||/5 and ||/S||j 5 are sets of objects from IS satisfying 
a and /3, respectively. The non-crisp inclusion, known in KDD, for the case of 
association rules is defined by means of two thresholds t and t' : 
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support I siot-iP) = card{\\a /\ PWjg) > t, and 

= a 

Elementary granule inclusion in a given information system IS can be defined 
using different schemes, e.g., by 

(a,/3) if and only if supportis {ot,P) >tk, accuracy is (ct,/3) > 

The closeness of granules can be defined by 

cl^f {a,P) if and only if (a,/3) and (/3, a) hold. 

Decision rules as granules. One can define inclusion and closeness of granules 
corresponding to rules of the form if a then /3 using accuracy coefficients. 

Having such granules g = (a,/3) , p' = can define inclusion and 

closeness of g and g^ by { 9 ^ 9 ') if oiily if • 

The closeness can be defined by 

{ 9 ^ 9 ') if and only if {9^9^ and {9^9) • 

Extensions of elementary granules by tolerance relation. For extensions 
of elementary granules defined by similarity (tolerance) relation, i.e., granules of 
the form (a,r), (/3,r) one can consider the following inclusion measure: 

((a,r) (/3,r)) if and only if 

(a',/3') for any a',/3' such that (a, a') G r and [P,P^) G r 
and the following closeness measure: 

d'lft' ((a>'r) (/3>'r)) if and only if ^t,t' (0^^)) and ((/3,r) (a,r)) . 

Sets of rules. It can be important for some applications to define closeness 
of an elementary granule a and the granule (a,r) . The definition refiecting an 
intuition that a should be a representation of (a, r) sufficiently close to this 
granule is the following one: 

cl^f (a,(a, r)) if and only if clt^ff (a,/3) for any (a,/3) G r. 

An important problem related to association rules is that the number of such 
rules generated even from simple data table can be large. Hence, one should 
search for methods of aggregating close association rules. We suggest that this 
can be defined as searching for some close information granules. 

Let us consider two finite sets RuleSet and RuleSet^ of association rules 
defined by RuleSet = {(cki, Pi) : i = 1, . . . , A:} , and 

RuleSet* = {(a -, P*^ \ i = 1,. . . ,k*} . One can treat them as higher order infor- 
mation granules. These new granules RuleSet, RuleSet* can be treated as close 
in a degree at least t (in 75) if and only if there exists a relation rel between 
sets of rules RuleSet and RuleSet* such that: 

1. For any Rule from the set RuleSet there is Rule* from RuleSet* such that 
{Rule, Rule*) G rel and Rule is close to Rule* (in IS) in degree at least t. 

2. For any Rule* from the set RuleSet* there is Rule from RuleSet such that 
{Rule, Rule*) G rel and Rule is close to Rule* (in 75) in degree at least t. 

Another way of defining closeness of two granules Gi , G 2 represented by sets 
of rules can be described as follows. 

Let us consider again two granules RuleSet and RuleSet* corresponding 
to two decision algorithms. By 7(/3-) we denote the set {j : clp (/?'-, /?■)} for any 
i = 1, . . . ,k*. 
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Now, we assume Up {RuleSet, RuleSet^) if and only if for any i € {1, . . . , fc'} 
there exists a set J C {1, . . . , fc} such that 



Clnn 



( \ 

V 



and dp 



\ 

V V 



and for closeness we assume 

dp {RuleJSet^ RuleSet') if and only if 

i/p {Rule JSety Rule -Set') and Up {RuleSet' , Rule Set ) . 

One can consider a searching problem for a granule RuleSet' of minimal 
size such that RuleSet and RuleSet' are close. 

Granules defined by sets of granules. The previously discussed methods of 
inclusion and closeness definition can be easily adopted for the case of granules 
defined by sets of already defined granules. Let G, H be sets of granules. 

The inclusion of G in can be defined by 

i//f/ (G, H) if and only if for any g ^ G there is h G if for which uff, (p, h) 
and the closeness by cZff, (G,ff) if and only if (G,ff) and (if, G) . 

We have the following examples of inclusion and closeness propagation rules: 



for any a G G there is a' G if such that i^p(a, a') 

MG,H) 

dp{a,a'),dp{(3,/3') 

dp((a,/3),(a^/30) 

for any a' G r(a) there is G r(/3) such that i/p{a',j3') 

■ 0)) 



dp{G,G') and dp{E,E') 
dp{{G,E),dp{G',E^)) 

where o:,a',/3,/3' are elementary granules and G, G' are finite sets of elementary 
granules. 

One can also present other discussed cases for measuring the inclusion and 
closeness of granules in the form of inference rules. The exemplary rules have 
a general form, i.e., they are true in any information system (under the chosen 
definition of inclusion and closeness). 

4 Approximations of information granule sets 

We introduce now the approximation operations for granule sets assuming a 
given granule system Q specified by syntx, semantics of information granules 
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from the universe U as well as by the relations of inclusion Vp and closeness clq 
in degrees at least p, g, respectively. 

For a given granule g we define its neighborhood Ip{g) to be the set of all 
information granules from U close to g in degree at least p. 

For any subset X CU vje define its lower and upper approximation by 

LOW (5, p,q,X) = {g^U: Uq {Ip (g) ,X)}, 

UPP{Q,p,t,X) = {geU :vt {Ip (p) ,X)}, respectively, 
where Uq {Ip (p) , X) iff for any granule r ^ Ip {g) the condition i^q (r, x) holds for 
some X e X and 0.5 < t < q. 

Hence it follows that the approximations of sets can be defined analogously 
to the classical rough set approach. In our next paper we will discuss how to 
define approximation of complex information granules taking into account their 
structure (e.g., defined by the relation to he a part in a degree [2]). 



Conclusions 

We have presented the concept of approximation of complex information granule 
sets. This notion seems to be crucial for further investigations on approximate 
reasoning based on information granules. 
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Abstract. The concept of (crisp) set is now extended to fuzzy set and 
rough set. The key notion of rough set is the two boundaries, the lower 
and upper approximations, and the lower approximation must be inside 
of the upper approximation. This inclusive condition makes the inference 
using rough sets complex: each approximation can not be determined in- 
dependently. In this paper, the probabilistic inferences on rough sets 
based on two types of interpretation of If-Then rules, conditional proba- 
bility and logical implication, are discussed. There are some interesting 
correlation between the lower and upper approximation after probabilis- 
tic inference. 



1 Introduction 

In propositional logic, the truth values of propositions are given either l(true) or 
O(false). Inference based on propositional (binary) logic is done using inference 
rule : Modus Ponens, shown in Fig. 1(a). This rule implies that if an If-Then rule 
“A ^ S” and proposition A are given true(l) as premises, then we come to a 
conclusion that proposition B is true(l). 

The inference rule based on propositional logic is extended to probabilistic 
inference based on probability theory in order to treat uncertain knowledge. 
The truth values of propositions are given as the probabilities of events that 

take any value in the range of [0, 1]. Here, U is the sample space (universal set), 

A^B C U are events, and the probability of “an event A happens”, P(M) is 
defined as P(M) = |A|/|U| {\U\ = 1, |A| = a G [0, 1]) under the interpretation 
of randomness. Thus the probabilistic inference rule can be written as Fig. 1(b) 
adapting the style of Modus Ponens. 

A^B P{A^B) = ^ 

A P(A)=a 

B P(B)=b i, a, be [0,11 

Fig. 1. (a)Modus Ponens (b) Probabilistic Inference 

W. Ziarko and Y. Yao (Eds.): RSCTC 2000, LNAI 2005, pp. 73-81, 2001. 
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If the probability of ^ ^ and A are given 1 (i = a = 1), then 6 is 1, since 
the probabilistic inference should be inclusive of modus ponens as a special 
case. Our focus is to determine the probability of B from the probabilities of 
A ^ B and A that take any value in [0, 1]. A B is interpreted as “if A 
is true, then B is true” in meta-language. Traditional Bayes’ theorem applied 
in many probability system adopts conditional probability as the interpretation 
of If-Then rule. However, the precise interpretation of the symbol is not 
unique and still under discussion among many researchers. 

Nilsson [1] presented a semantical generalization of ordinary first-order logic 
in which the truth values of sentences can range between 0 and 1. He established 
the foundation of prohahilistic logic through a possible- world analysis and prob- 
abilistic entailment. However, in most cases, we are not given the probabilities 
for the different sets of possible worlds, but must induce them from what we are 
given. 

Pawlak [2] discussed the relationship between Bayes’ inference rule and de- 
cision rules from the rough set perspective, and revealed that two conditional 
probabilities, called certainty and coverage factors satisfy the Bayes’ rule. Re- 
lated works are done by Yao [6]- [9] on Interval-set and Interval- valued proba- 
bilistic reasoning. 

Our goal is to deduce a conclusion and its associated probability from given 
rules and facts and their associated probabilities through simple geometric anal- 
ysis. The probability of the sentence “if A then B” is interpreted in two ways: 
conditional probability and the probability of logical implication. We have de- 
fined the probabilistic inferences based on the two interpretations of “If-Then” 
rule, conditional probability and logical implication, and introduce a new variant 
of Bayes’ theorem based on the logical implication [5]. 

In this paper, analysis on Rough-set based probabilistic inference are done. 
There are some interesting correlations-relations between the lower and the up- 
per approximation since the lower approximation is inside the upper approxi- 
mation. This restriction between the lower and upper probabilities is discussed 
in detail and the traditional Bayes’ theorem and a new variant of Bayes’ the- 
orem based on the logical implication are applied to the probabilistic inference 
on rough sets. 

2 Probabilistic Inference and Bayes Theorem 

2.1 Conditional Probability 

Conditional probability, “how often B happens when A is already (or necessary) 
happens”, only deals with the event space that A certainly happens. Thus the 
sample space changes from U to A. 

P(AAR) = P{B\A) = P{A DB)^ P{A) {a 7^ 0) (1) 

Given P{A-^B) = ic and P(A) = a, P(Ani^) = ic x a from Equation(l). 
Thus the size of the intersection of A and B is fixed. The possible size of B is 
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determined from ^ Pi to Pi i^) + The probabilistic inference based on 
the interpretation of if-then rule as the conditional probability determines P(i^) 
from given P{A-^B) and P(^) by the following inference style in Fig. 2. 




minimum P(.B) 

= P(A n B) 

= P{aAb) X P(A) 
maximum P(B) 

= P(A n i^) + P(-A) 



P{A^B)=ic 

P(A) = a 

P{B) e [a ^ Zc } a Zc A (1 — 



Fig. 2. Conditional Probability 



Note, P(i^) can not be determined uniquely from P(A ^ B) and P(^) thus 
expressed as the interval probability [4] . When the condition, axic = l—a{l — ic) 
(thus a = 1), holds, P(i^) is unique and equal to P(^ ^ i^). 



2.2 Logical Implication 

The interpretations of ^ (implication) in logics: propositional (binary or Boolean) logic, 
multi-valued logic, fuzzy logic, etc., are not unique in each logic. However, the 
most common interpretation oi A ^ B is r^AV B. 

p{aXb) = p(-aui^) 

= P(Unii) + P(-U). {P{A) + p{aXb) >1) (2) 

Since P(U C\ B) = P(aXb) — P('~U) from Equation (2), the possible area of B 

is determined from A O B to AO B -\- ^A (A-^B). The probabilistic inference 
based on the interpretation of if-then rule as the logical implication determines 
P(i^) as the interval probability from given P{A-^B) and P(^) as shown in 

Fig-3. 




minimum P(.B) 
= P(A n B) 

= P(aAb) - 

maximum P(B) 

= p{aAb) 



P(~A) 



P{AXB)=ii 

P(A) = a 

P{B) e ii] 



Fig. 3. Logical Implication 



Similar to the conditional probability, P(i^) is unique and equal to P{A-^B) 
when the condition ii ct — 1 = ii (thus a = 1) holds. 
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2.3 Bayes’ Theorem 

Bayes’ theorem is widespread in application since it is a powerful method to trace 
a cause from effects. The relationship between a priori probability V{A-^B) 
and a posteriori probability V{B-^A) is expressed in the following equation by 
eliminating P{An B) from the definitions. 

P{A^B) = P{B\A) = P{A dB)^ P{A), 

P{B^A) = P{A\B) = P{AdB)^P{B), 

P{BAA) = P{A^B) X P(^) ^ P{B) (3) 

Theorem 2.31 The interpretation of If-Then rule as the logical implication sat- 
isfies the following equation, 

P{B^A) = P{A^B) + P(A ) - P(B ) (4) 



Proof, 



P{B-Ta) = P{r^BuA) 

= P{AnB) + P{r^B) 

= P{AnB) + l-P{B) 

= P(^ r\B) + l- P(^) + P(^) - P{B) 

= P(^ r\B)+ P(-A) + P(^) - P{B) 

= P{A-^B)+P{A)-P{B) □ 

Note, the new variant of the Bayes’ theorem based on logical implication 
adopt addition + and subtraction — where the traditional one adopt multi- 
plication X and division -X. This property is quite attractive in operations on 
multiple-valued domain, and simplicity of calculation. 



2.4 Bayes’ Inference Based on Conditional Probability 

Now, we apply Bayes’ theorem as the inference rule, and define P(i^A^) from 
P(A-^B)^ ^(^)* The inference based on the traditional Bayes’ the- 

orem (conditional probability) is shown in Fig. 4 (a). From P(^Ai^) and P(^), 
P(i^) is determined as the interval probability by the inference rule. Fig. 2 in the 
previous discussion. Thus P(i^A^) can be determined as follows when P(i^) is 
unknown. 

The condition, max((a + 6 — 1), 0)/a < ic < 6/a, must be satisfied between 
the probabilities a, 6, and ic- Since ic = P{An B)/ a thus max(a + 6—1,0) < 
P(A f] B) < min(a, 6). 




Probabilistic Inference and Bayesian Theorem on Rough Sets 77 



P{A^B) = ic 
P{A) =a 
=b 

P{B-^A) = ic X a 



P{A^B)=ic 
P{A) = a 

P{B) G [ic X a, ic X a + (1 — a)] 

P{B-^A) G [ic X a (ic X a + (1 — a)), 1] 



Fig. 4. (a)Bayes’ Inference - Conditional (b)P(i^) unknown 



2.5 Bayes’ Inference Bsised on Logical Implication 

Similarly, applying the new variant of Bayes’ theorem based on logical impli- 
cation, we get the following inference rule in Fig. 5. P(i^) is determined from 

and P(^) by the inference rule. Fig. 3. Thus P(i^-^^) can be deter- 
mined as follows when P(i^) is unknown. 



p(a4b) = ii 

P(^) = a 
P{B) = h 

P{B-^A) = ii+a-h 



P(^Vs) = ii 
P(A) =a 

P(-B) G [i^ T a — 1, ii\ 
P{B-^A) e [a, 1] 



Fig. 5. (a) Bayes’ Inference - Logical (b)P(B) unknown 



The condition, max(6, 1 — a) < ii < 1 — a + 6, must be satisfied between 
the probabilities a, 6, and ii. Since ii = P{An B)/ a thus max(a + 6 — 1,0) < 
P(A C\ B) < min(a, 6). 

3 Probabilistic inference on Rough Sets 

Rough set theory is introduced by Z. Pawlak in 1982 [3], and developed as 
a mathematical foundation of soft computing. Rough set A is defined by the 
upper approximation *A outside the lower approximation as shown in Fig. 6. 




Fig. 6. Rough Set 
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Thus the probability for a rough set A is given by a paired values, (ai,a 2 ). 
a I represents the probability of the lower approximation, and U 2 represents that 
of the upper approximation. 

Given the set of all paired values 7^, 

P = {(a, 6) I 0 < a < 6 < 1} 

the paired probability[4] of “A happens” is P(A) G (ui,a 2 ). 



3.1 Inferences on Rough Sets 

In the previous section, given P(A ^ B) and P(A), the size of P{AnB) is 
determined. Thus P(i7) must be greater than or equal to P{AnB) and less 
than or equal to P(A Pi S) + P(^A). This condition holds for both P(AAS)) 
and P(A-^i7). Given {A ^ B) and A as rough sets, P{A ^ B) and P(A) are 
paired probabilities (GG 2 ) and (ai,a 2 ). The possible probabilities of the lower 
and upper bound of i7, h=P( 47) and *P(i7), are determined as the same manner 
in previous discussion. 




Fig. 7. Probabilistic Inference on Rough Sets 



However, since there is an inclusive relation between the lower and upper 
bounds, they are restricted to each other by certain conditions. For example, if 
h=A =* A, then *P(R) — P(R) (P{BndB)) must be greater than or equal to 
*P(A n B) P(A n B) {F{BndA n B)), 




Fig. 8. Probabilistic Inference on Rough Sets - ^ A A 
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Conditional probability P(AAB) is given as P(AnB) P(A). Thus, we 
assume the definition of the lower and upper conditional probabilities as follows, 






,P(AnB) 

.P(^) 



*p(aAb) = 



*P(AnB) 

*P(A) 



Because of the devision in the definition of conditional probability, +P(AAB) 
< *P(AAB) is not always true. 

Given P(AAB) = (+So/®o) s^nd P(A) = (+a,* a), the minimum *P(B) is 
determined as *P(An B) =* x* a, and the maximum *P(S) is *P(AnB) + 
*P(f^A) —* io X* a+ (1 — * a). Since +P(B) P(^)t +P(^) restricted by the 

sizes of BndA and Bnd(A PI B). 




P(A) = Ua*a) 

6 1 + Q. X :4; "j-c j +Q. -|- (1 + o) 

— ma:x{0, A) — Br^d{ A H B)}\ 
*P(B) *ax* {1 -* 0.)] 



Fig-9. Rough Set Inference - Conditional Probability 



Similarly^ the probability of logical implication P(A-^B) is given as P(^AU 
B) = P(A n B) + P(^A) , Thus^ we assume the definition of the lower and upper 
probabilities based on logical implication as follows, 

,P(a4e) =, P(AnE) + 1-*P(A) "P(a4b)=+ P(AnE) + 1-,P(A) 

Note, the lower and upper probabilities are not determined independently, since 
the definition of includes negation 

Given P(A-^R) = and P(A) = (+a,‘^ a), the minimum +P(B) is 

determined as *a— (1 — + ii)^ and the maximum +P(5) fs +ii- '^P(B) is restricted 
by the sise of +P(5) because of the inclusive condition. 




p( A-^^) = -ii) 

P(A) = (^aCa) 
7p^^ +7i] 

*P(^) e [ma:x{+Q -(1 - (1 ii)}, %] 



Fig- 10- Rough Set Inference - Logical Implication 
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3.2 Bayes’ Theorem on Rough Sets 

Now, we apply two types of Bayes’ theorem as the rough set inference on paired 
probabilities. 

Conditional probability P(A^R), P(^) and P(R) satisfy Bayes’ theorem. 



,P(SAyl) = ,P(^AS) X, P{A) P{B) (5) 

*P(SAa) = *P(^AS) X* P{A) P{B) (6) 

Given P(A-^B) = and P(^) = (=i=a,*a) as the paired probabili- 

ties, P(i^) is determined in previous discussion. Thus, when P(i^) is unknown, 
P(B-^A) is determined as follows. 



,P{BAA) e 
^P{B^A) e 



kOj X Zn 



ic -\- {I a) — max{0, Bnd{A) — Bnd{A n B)} 



a X ir 



X * ic + (1 — * a) 



, 1] 



1] 



Similarly, probability based on logical implication P{A-^B)j P(^) and P(i^) 
satisfy the new type of Bayes’ theorem. 



^-P{bXa) = *V{aXb) +* P(^) V{B) (7) 

*V{bXa) = ^-P{aXb) +♦ P(^) -* P(i^) (8) 

Given P[A-^B) = h) and P(^) = (=i=a,*a) as the paired probabili- 
ties, P(i^) is determined in previous discussion. Thus, when P(i^) is unknown, 

P{B-^A) is determined as follows. 

^P{B^A) G [*a, 1 — max{0, Bnd{A^B) — Bnd{A)}] 

*^{bXa) e [*a, 1] 

4 Conclusion 

Probabilistic inference on rough sets is discussed. Given the sizes of P(4 ^ B) 
and P(4), the size of P(4 Pi B) is calculated in both interpretation of “If-then 
rule: Thus P(i^) is determined and applying Bayes’ theorem, P{B ^ A) is 

also determined. However, in rough set inference, the inclusive relation between 
the lower and upper bound influences each other, thus the lower and upper 
probabilities are not determined independently. This feature is quite unique and 
distinguish rough set inference from interval probability and other approximation 
methods. Farther discussion should be to analyse mathematical aspects of this 
inference and apply it to knowledge discovery and data mining. 
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Abstract. A fundamental difficulty with fuzzy set theory is the seman- 
tical interpretations of membership functions. We address this issue in 
the theory of rough sets. Rough membership functions are viewed as a 
special type of fuzzy membership functions interpret able using condi- 
tional probabilities. Rough set approximations are related to the core 
and support of a fuzzy set. A salient feature of the interpretation is 
that membership values of equivalent or similar elements are related to 
each other. Two types of similarities are considered, one is defined by a 
partition of the universe, and the other is defined by a covering. 



1 Introduction 

The theory of fuzzy sets is a generalization of the classical sets by allowing partial 
membership. It provides a more realistic framework for modeling the ill- definition 
of the boundary of a class [2, 8]. However, a fundamental difficulty with fuzzy 
set theory is the semantical interpretations of the degrees of membership. The 
objectives of this paper are to investigate interpretations of fuzzy membership 
functions in the theory of rough sets, and to establish the connections between 
core and support of fuzzy sets and rough set approximations. 

There are at least two views for interpreting rough set theory [4,6]. The 
operator-oriented view treats rough set theory as an extension of the classical 
set theory. Two additional unary set-theoretic operators are introduced, and the 
meanings of sets and standard set-theoretic operators are unchanged. This view 
is closely related to modal logics. The set-oriented view treats rough set the- 
ory as a deviation of the classical set theory. Sets and set-theoretic operators are 
associated with non-standard interpretations, and no additional set-theoretic op- 
erator is introduced. This view is related to many-valued logics and fuzzy sets. A 
particular set-oriented view is characterized by rough membership functions [3]. 
The formulation and interpretation of rough membership functions are based 
on partitions of a universe. By viewing rough membership functions as a spe- 
cial type of fuzzy membership functions, one may be able to provide a sound 
semantical interpretation of fuzzy membership functions. 

The rest of the paper is organized as follows. In Section 2, we examine the 
interpretation of fuzzy membership functions and show the connections between 
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rough set approximations and the core and support of fuzzy sets, based on rough 
membership functions defined by a partition of a universe. A unified framework is 
used for studying both fuzzy sets and rough sets. In Section 3, within the estab- 
lished framework, we apply the arguments to a more general case by extending 
partitions to coverings of the universe. Three different rough membership func- 
tions are introduced. They lead to commonly used rough set approximations. 

2 Review and Comparison of Fuzzy Sets and Rough Sets 

In this section, some basic issues of fuzzy sets and rough sets are reviewed, exam- 
ined, and compared by using a unified framework. Fuzzy membership functions 
are interpreted in terms of rough membership functions. The concepts of core 
and support of a fuzzy set are related to rough set approximations. 



2.1 Fuzzy Sets 

The notion of fuzzy sets provides a convenient tool for representing vague con- 
cepts by allowing partial memberships. Let t/ be a finite and non-empty set 
called universe. A fuzzy subset A of U is defined by a membership function: 

fiA ■■ U ^ [0,1]. ( 1 ) 

There are many definitions for fuzzy set complement, intersection, and union. 
The standard min-max system proposed by Zadeh is defined component- wise 

by [8]: 



= 'i-- 

I^Aneix) = hb{x)), 

I^Aub{x) = (2) 

In general, one may interpret fuzzy set operators using triangular norms (t- 
norms) and conorms (t-conorms) [2]. Let t and s be a pair of t-norm and t- 
conorms, we have: 



fJ'AnBix) = t{flA{x),HB{x)), 

IJ-Aunix) = s{iia{x),iib{x)). (3) 

A crisp set may be viewed as a degenerated fuzzy set. A pair of t-norm and 
t-conorm reduce to standard set intersection and union when applied to crisp 
subsets of U. The min is an example of t-norms and the max is an example of t- 
conorms. An important feature of fuzzy set operators as defined by t-norms and 
t-conorms is that they are truth- functional operators. In other words, member- 
ship functions of complement, intersection, and union of fuzzy sets are defined 
based solely on membership functions of the fuzzy sets involved [6]. Although 
they have some desired properties, such as ^ac\b{x) < min[/u^(a:), < 
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max[/i^(a:), //^(a:)] < //^ub(^)? there is a lack of a well accepted semantical 
interpretation of fuzzy set operators. 

The concepts of core and support have been introduced and used as approx- 
imations of a fuzzy set [2]. The core of a fuzzy set ^ is a crisp subset of U 
consisting of elements with full membership: 

core{A) = {x G U \ = 1}. (4) 

The support is a crisp subset of U consisting of elements with non-zero mem- 
bership: 

support(A) = {x €:U \ > 0}- (5) 

With 1 — (•) as fuzzy set complement, and t-norms and t-conorms as fuzzy set 
intersection and union, the following properties hold: 

(FI) core{A) = -t{support{-iA))^ 
support{A) = -i(core(-i^)), 

(F2) core{A HB) = core{A) fl core{B), 

support {A n 6) C support (A) fl support(B), 

(F3) core{A U S) 5 core{A) U core(S), 

support{A \JB) = support{A) U support(B), 

(F4) core{A) C ^ C support{A). 

According to (Fl), one may interpret core and support as a pair of dual operators 
on the set of all fuzzy sets. They map a fuzzy set to a pair of crisp sets. By 
properties (F2) and (F3), one may say that core is distributive over fl and 
support is distributive over U. However, core is not necessarily distributive over 
U and support is not necessarily distributive over fl. These two properties follow 
from the properties of t-norm and t-conorm. When the min-max system is used, 
we have equality in (F2) and (F3) . Property (F4) suggests that a fuzzy set lies 
within its core and support. 



2.2 Rough Sets 

A fundamental concept of rough set theory is indiscernibility. Let R C U x U 
be an equivalence relation on a finite and non-empty universe U. That is, the 
relation R is reflexive, symmetric, and transitive. The pair apr = (U,R) is called 
an approximation space. The equivalence relation R partitions the universe U 
into disjoint subsets called equivalence classes. Elements in the same equivalence 
class are said to be indistinguishable. The partition of the universe is referred to 
as the quotient set and is denoted by U/R= {Ei, . . . , E^}. 

An element x EU belongs to one and only one equivalence class. Let 



Nfl = {y I xRy}, 



( 6 ) 
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denote the equivalence class containing x. For a subset A C {7, we have the 
following well defined rough membership function [3]: 



Iia{x) 



IMill 



( 7 ) 



where | • | denotes the cardinality of a set. One can easily see the similarity 
between rough membership functions and conditional probabilities. As a matter 
of fact, the rough membership value iia{x) may be interpreted as the conditional 
probability that an arbitrary element belongs to A given that the element belongs 
to [x]r. 

Rough membership functions may be interpreted as a special type of fuzzy 
membership functions interpret able in terms of probabilities defined simply by 
cardinalities of sets. In general, one may use a probability function on U to define 
rough membership functions [7]. One may view the fuzzy set theory as an un- 
interpreted mathematical theory of abstract membership functions. The theory 
of rough set thus provides a more specific and more concrete interpretation of 
fuzzy membership functions. The source of the fuzziness in describing a concept 
is the indiscernibility of elements. In the theoretical development of fuzzy set 
theory, fuzzy membership functions are treated as abstract mathematical func- 



tions without any constraint imposed [2]. When we interpret fuzzy membership 
functions in the theory of rough sets, we have the following constraints: 


(rml) 


Huix) = 1, 


(rm2) 


II 

0 


(rm3) 


y e [x]r fiA(x) = y.A{y), 


(rm4) 


X e A ij^a{x) ^ 0 


(rm5) 


= 1 X e A^ 


(rm6) 


AC B => < ^b{x). 



Property (rm3) is particularly important. It shows that elements in the same 
equivalence class must have the same degree of membership. That is, indis- 
cernible elements should have the same membership value. Such a constraint, 
which ties the membership values of individual elements according to their con- 
nections, is intuitively appealing. Although this topic has been investigated by 
some authors, there is still a lack of systematic study [1]. Property (rm4) can be 
equivalently expressed as jj^a{x) = 0 => x ^ A, and property (rm5) expressed 
as a: ^ A fJiA(x) 7^ 1. 

The constraints on rough membership functions have significant implications 
on rough set operators. There does not exist a one-to-one relationship between 
rough membership functions and subsets of U, Two distinct subsets of U may 
define the same rough membership function. Rough membership functions cor- 
responding to -lA, A n S, and A U S must be defined using set operators and 
equation (7). By laws of probability, we have: 



= 1- ha{x), 




86 Y.Y. Yao and J.P. Zhang 



fJ-AUB{x) = fJ.A{x) + - HAnB {x) , 

max{0,iJ,A{x) + (j,b{x) - 1) < HAnB(x) < min(/i^(a;),/iB(a;)), 
max{fiA{x),iJ,B{,x)) < fj,AuB{x) < min(l,/x^(a;) + (isix)). (8) 

Unlike the commonly used fuzzy set operators, the new intersection and union 
operators are non-truth-functional. That is, it is impossible to obtain rough 
membership functions of An B and Au B based solely on the rough member- 
ship functions of A and B. One must also consider their relationships to the 
equivalence class [x]^. 

In an approximation space, a subset A CU is approximated by a pair of sets 
called lower and upper approximations as follows [3]: 

apr{A) = {x e U \ ij,a{3^) = 1} 

= core(fiA), 

opr (A) = {x £ U \ fiA{x) > 0} 

= support{(iA)- ( 9 ) 

That is, the lower and upper approximation are indeed the core and support of 
the fuzzy set pLA- For any subsets A,B GU, we have: 

(Rl) apr {A) = -i(o^(-iA)), 
opf(A) = -i(apr(-iA)), 

(R2) apr{A 0 B) = apr {A) fl apr {B), 
n 5) C apr(A) fl apf(5), 

(R3) apr{A \J B) D apr{A) U apr(B), 
apr{A U B) = apr{A) U apr{B)^ 

(R4) apr{A) C A C apr(A), 

By comparing with (F1)-(F4), we can see that rough set approximation operators 
satisfy the same properties of core and support of fuzzy sets. 

Using the equivalence classes U/R = {F^i, . . . lower and upper approx- 

imations can be equivalently defined as follows: 

apr{A) = UiSi G U/R \EiC A}, 

apr(A) = |J{Bi eU/R\EinA^ 0}. (10) 

The lower approximation apr {A) is the union of all the equivalence classes which 
are subsets of A. The upper approximation a^{A) is the union of all the equiv- 
alence classes which have a non-empty intersection with A. 

3 Generalized Rough Membership Functions based on 
Coverings of the Universe 

In a partition, an element belongs to one equivalence class and two distinct 
equivalence classes have no overlap. The rough set theory built on a partition, 
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although easy to analyze, may not provide a realistic view of relationships be- 
tween elements of the universe. One may consider a more realistic model by 
extending partitions to coverings of the universe [6,9]. 

A covering of the universe, C = {<7i, . . . , Cn}, is a family of subsets of U such 
that U = \J{Ci 1^ = 1,... ,n}. Two distinct sets in C may have a non-empty 
overlap. An arbitrary element x oiU may belong to more than one set in C. The 
family C{x) = {Ci ^ C \ x ^ Ci} consists of sets in C containing x. The sets 
in C{x) may describe different types or various degrees of similarity between 
elements of U. For a set Ci G C{x)^ we may compute a value \Ci fl A|/|(7i| 
by extending equation (7). It may be interpreted as the membership value of 
X from the view point of Ci. From the set (7(x), we have a family of values 
{|(7inA|/|C«| I X G Ci}. Generalized rough membership functions may be defined 
by using this family of values. We consider the following three definitions: 



(minimum) 


/x^(x) =min| 


' \Ci n A| 
. IC'il 


1 X £ Cij , 


(11) 


(maximum) 




\\CiC\A\ 

X 


1 X € Cij , 


(12) 


(average) 


Ii*a{x) = avgj 


\Cif\A\ 

ic'd ' 


X G Cj 1 . 


(13) 



The minimum, maximum, and average definitions may be regarded as the most 
permissive, the most optimistic view, and the balanced view in defining rough 
membership functions. The minimum rough membership function is determined 
by a set in C{x) which has the smallest overlap with A, and the maximum rough 
membership function by a set in C[x) which has the largest overlap with A. The 
average rough membership function depends on every set in C{x). The three 
rough membership functions are related by: 

( 14 ) 

A partition is a special type of coverings. In this case, three rough membership 
functions reduce to the same rough membership function. 

The generalized rough membership functions have the following properties: 

(grml) iJ^{x) = Hu{x) = fj,^ (x) = 1, 

(grm2) nf (x) = /xg (x) =fj,^{x)=0, 

(grm3) [VCi e C{x e Ci y e C*)] 

(grm4) x,y^Ci=i> [ha{^) 7^ 1 f^Aiv) 1>Ma (a:) # 0 /x^(y) 7 ^ 0], 

(grmS) X & A=^ i^™{x) 

(grm6) n^{x) = X £ A, 

(grm7) AC B (x) <lABix),n^ (x) < (x) , (x) < (x)] . 

Both (grm3) and (grm4) show the constraints on rough membership functions 
imposed by the similarity of objects. From the relation 
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we can obtain additional properties. For example, (grm5) implies that x ^ A => 
7 ^ 0 and x G A 7^ 0. Similarly, (grm6) implies that = 

1 X ^ A and Ma(^) = ^ x ^ A. 

For set-theoretic operators, one can verify that the following properties: 

= 1 - 

max{0,n^{x) + n^{x) - Haub{x)) < I^Ansix) < min(/x2(a;),MB (a:)), 
max(/i^ (a;), (a;)) < fiAusix) < min(l, /j,^ (x) + jj,^ (x) - /x™ni?(®)). 

fJ-Ansi^) = M!4(a;) + MB(a;) - f^AuBi^)- (15) 

We again obtain non-truth-functional rough set operators. 

The minimum rough membership function may be viewed as the lower bound 
on all possible rough membership functions definable using a covering, while the 
maximum rough membership as the upper bound. The pair ii^{x)) may 

also be used to define an interval- valued fuzzy set [2]. The interval [/i^(a:), ii^{x)\ 
is the membership value of x with respect to A. 

From the three rough membership functions, we define three pairs of lower 
and upper approximations. For the minimum definition, we have: 

apr^jA) = coreifi^) 

= {x eu\ = 1 } 

= {x£U\\fCi£ C{x eCi^CiC A)}, 
apr^{A) = spport{f/^) 

= {x eu\ n^{x) > 0 } 

= {x£U\\fCi£C(x£Ci^CinA^i!i)}. (16) 

For the maximum definition, we have: 
a^^{A) = core{n^) 

= {xGU\n^{x)=l} 

= {xeU\3Cie C(x e Q, Ci c A)}, 

= GC\CiCA}, 

apr^ (A) = spportin^) 

= {x^U\^l^{x)>Q} 

= {x^U\3Ci^ C(x eCi,CinAf^ 0)} 

= U{Ci 6CjCinA^0}. (17) 

The lower and upper approximations in each pair are no longer dual operators. 
However, (apr”^,opf^) and (apr ^ are two pairs of dual operators. The 
first pair can be derived from the average definition, namely: 

apr"^(A) = apr^(A), apf’^(A) = apr^(A). 



( 18 ) 
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These approximation operator have been studied extensively in rough set theory. 
Their connections and properties can be found in a recent paper by Yao [5]. 

4 Conclusion 

Rough membership functions can be viewed as a special type of fuzzy member- 
ship functions, and rough set approximations as the core and support of fuzzy 
sets. This provides a starting point for the interpretation of fuzzy membership 
functions in the theory of rough sets. We study rough membership functions 
defined based on partitions and coverings of a universe. 

The formulation and interpretation of rough membership functions are in- 
separable parts of the theory of rough sets. Each rough membership function 
has a well defined semantical interpretation. The source of uncertainty modeled 
by rough membership functions is the indiscernibility or similarity of objects. 
Constraints are imposed on rough membership functions by the relationships 
between objects. More specifically, equivalent objects must have the same mem- 
bership value, and similar objects must have similar membership values. These 
observations may have significant implications for the understanding of fuzzy 
set theory. The interpretation of fuzzy membership functions in the theory of 
rough sets provides a more restrictive, but more concrete, view of fuzzy sets. 
Such semantically sound models may provide possible solutions to the funda- 
mental difficulty with fuzzy set theory regarding semantical interpretations of 
fuzzy membership functions. 
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Abstract. This paper presents an approach to visualizing a system’s dynamic 
performance with rough performance maps. Derived from the Julia set 
common in the visualization of iterative chaos, performance maps are 
constructed using rule evaluation and require a minimum of a priori knowledge 
of the system under consideration. By the use of carefully selected performance 
evaluation rules combined with color-coding, they convey a wealth of 
information to the informed user about dynamic behaviors of a system that may 
be hidden from all but the expert analyst. A rough set approach is employed to 
generate an approximation of a performance map. Generation of this new 
rough performance map allows more intuitive rule derivation and requires 
fewer system parameters to be observed. 



1 Introduction 

An approach to visualizing control and other dynamical systems with performance 
maps based on rough set theory is introduced in this paper. A performance map is 
derived from the Julia set method common in the study of chaotic equations [1]. Julia 
sets are fractals with shapes that are generated by iteration of simple complex 
mappings. The term fractal was introduced by Mandelbrot in 1982 to denote sets with 
fractured structure [2]. The critical component in a rough performance map is the use 
of rough set theory [3] in the generation of rules used to evaluate system performance 
in color-coding system responses. Performance map rules are derived using the 
method found in [4]. The images resulting from the application of derived rules yield 
very close approximations of full performance maps. Such maps are called rough 
performance maps. The contribution of this paper is the application of rough sets in 
the creation of performance maps in visualizing the performance of dynamical 
systems. 

W. Ziarko and Y. Yao (Eds.): RSCTC 2000, LNAI 2005, pp. 90-97, 2001. 
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2 Generation of a Performance Map 

It has been suggested that the Julia set method ean be adapted to visualize eontrol 
and other dynamie systems [4], Sueh a teehnique addresses the problem of 
visualizing automatieally the state values of some system as its parameters are varied. 
In effeet, the method produees a eolor-eoded performanee map (PM), whieh refleets 
the dynamie performanee of the system aeross intervals of system parameters. As 
with the Julia set, a PM is generated via digital eomputation, using rules appropriate 
to the system for whieh dynamie behavior is being visualized. The generation of a 
PM proeeeds in a manner similar to a Julia set, namely, with a pixel mapping and 
eolor seheme seleetion. The pixel mapping for a PM refleets the problem domain, i.e. 
a pair of system parameters. Intervals of system parameters, say parameters a and b, 
are mapped to a eomputer sereen that now represents the parametrie plane of the 
system. For example, intervals of parameter a and b ean be mapped to the x- and y- 
axis respeetively, so eaeh pixel, represents a unique value of (a,b). A 
performanee rule is used to evaluate the eondition of seleeted state variables of a 
system during numerieal integration. Sueh rules fall into two general types, fire/no 
fire and qualifying rules. A fire/no fire rule will fire if a programmed trigger 
eondition is eneountered during the simulation whilst a qualifying rule is used to plaee 
a qualifying value on an entire system trajeetory following a period of integration. 
Performanee rules are typieally aggregated to test and deteet multiple behaviors, but 
in any ease, their generation is normally intuitive following observations of the 
dynamie behavior of a system. 



3 Simulation for Linear Plant 



A performanee map ean be generated for any system provided that the problem 
domain ean be mapped to the eomputer sereen and that appropriate performanee rules 
ean be derived. In order to demonstrate the utility of this teehnique, a performanee 
map is now developed for the readily verified seeond-order, linear, eontinuous system 
in(l). 



Y{s) 



2 

( 0 ^ 

s • + 2 • ^ • co^ • s + (o^ ^ 



( 1 ) 



where ^ ^damping ratio and (O^ ^ natural undamped frequeney. 



3,1 Generation of Performance Rules 

Four qualitative measurements that ean be evaluated by fire/no-fire rules, namely, 
maximum overshoot, (9^, delay time, rise time, and settle time, have been 
ehosen. While the tuning of parameters ^ and (O^ via elassieal teehnique s ean set 
these measurements, it would be most benefieial to be able to assess the parameters to 
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set all measurements at onee to meet tuning eriteria. To develop an appropriate rule 
set, it is first neeessary to determine the required performanee of the system. The rule 
set will then be used to determine whieh sets of parameter values [cOa, C\ affeet this 
performanee. The tuning of the system in terms of maximum overshoot, delay 

time, tj, rise time, and settle time, are somewhat arbitrary. For the purposes 
of illustration, the required performanee eriteria of the system is seleeted to be: 
maximum overshoot: O = 0, maximum delay: — 0.038s, maximum rise time: 

= 0.068s and maximum settle time: = 0.095s. 

The test for maximum overshoot is the most straightforward of the four 
measurements. This rule will trigger if the magnitude of the system solution y(t) > 1 
at any time during integration, indieating that an overshoot has oeeurred, i.e. O^>0. 

The test for delay time, is equally simple, requiring that the magnitude of the 
system solution be tested at the seleeted time, tj = 0.038s. This test need only be 
performed onee. 

The measurement of rise time, does not begin until the system solution has 
reached 10% of its final value. In effect, reaching this value serves to prime the rise 
time rule for trigger detection. Once the rule is primed, then the delay timer is 
initiated and magnitude of the system is tested at — 0.068s. This test is also 
performed once. Settle time, is tested in a fashion similar to delay time. Provided 
the system reaches the programmed settle time, the system solution is tested to 
determine if it is within 5% of its final value. In order to determine if the system 
remains within this boundary, this condition is tested following each integration step 
until the maximum integration time period has elapsed. 



3.2 Theoretical Results 

Obviously, when the system is stable and damped, the maximum overshoot of the 
second-order, linear system occurs on the first overshoot, for a unit step input, at a 
time, which is given by (2). 



K ( 2 ) 

t - 1 

CO^-sJl-C 

The system is critically damped with zero overshoot when ^ = 1, and will exhibit 
an overshoot whenever ^ < 1. The system is critically damped with zero overshoot 
when ^ = b will exhibit an overshoot whenever ^ < h The rise time b and delay 
time td of the system may be approximated (see Fig. 1) as can the settle time 4 (see 
Fig. 2) [7],[8],[9]. 
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Fig. 1. 0 )^ vs. ^ for td ^ 0.038 and 



Fig. 2. Graph ofO)^ vs. C for 



tr ^ 0.068 



= 0.095. 



3.3 Sample Performance Map 



A reasonable expeetation of the performanee map is that it represents an 
amalgamation of these graphs, as well as a sharp division at ^ ^ 1.0 due to overshoot. 
This is indeed the ease as shown by the performanee map of Fig. 3. 




n 



□ 

□ 



□ 



Fig. 3. Performance map of 2nd order linear system 

The eolor key assoeiated with the map shows the pixel eolors employed to 
represent the rules that did and did not fire. In the key, a “1” means that a rule fired, 
indieating that required performanee was not aehieved, whilst a “0” indieates that the 
rule did not fire, and the performanee was aeeeptable for that measurement. The 
eolor blaek, eomprising the majority of the upper right quadrant of the map, is used 
where no rules fired, thus giving an immediate and intuitive visualization of the 
parameter values, that affeet the required system response. Thus, the 

seleetion of any pixel within this blaek region will yield a parameter set that not only 
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provides stable operation, but also operation whieh is finely tuned for desired 
performanee. 



3,4 Validation of Results 

The software written to generate the performanee map allows the user to seleet a 
single pair of parameter values from the map by “elieking” on the image with a 
eomputer mouse. This allows the system dynamies to be further explored for these 
parameters sueh as by using time graphs, phase portraits or Poineaire maps. A region 
of the map ean also be seleeted, and a new visualization generated for that set of 
parameter intervals, effeetively rendering a elose-up view of a map region. In order 
to validate the information portrayed by the performanee map, time graphs of the 
system response are eonsidered for parameter sets seleeted at random from the blaek 
and eolored regions of the map. Fig. 4 eontains a time graph generated for a set of 
parameters, seleeted at random from the blaek region of the map. As stated 

above, this region refleets the parameters for whieh the system exhibits a stable, 
aeeeptable response. The time graph elearly shows that the plant aetion is well within 
the required speeifieations, i.e. no overshoot, settle time within 0.095s, and so on the 
required speeifieations, i.e. no overshoot, settle time within 0.095s, and so on. 

Fig. 5 eontains a time graph generated for a set of parameters, seleeted 

from the white region of the map, where all rules fired, indieating that the maximum 
overshoot, (9^, delay time, rise time, and settle time, all exeeed the limits 
required. This graph elearly shows that the system response fails to meet these 
requirements, further validating the performanee map teehnique. 





Fig. 4. Time graph of seeond-order Fig. 5. Time graph of seeond-order 

linear system, with co^ - 148.173 and linear system, with co^ = 11.296 and 

c = T429. C = 0339. 



4 Approximation of Performance Maps 

The eomputational overhead in the generation of a performanee map is signifieant, 
with a map often requiring as many as 9x10^ numerieal integration steps [4]. In 
addition, it is possible that some state values required in the formulation of 
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performance rules can be unobservable, negating their value in system evaluation. It 
is thus desirable develop a means whereby the number of variables required to test a 
system’s performance is reduced. To accomplish this reduction, a rough set to rule 
derivation is considered [4], [10], [11]. The use of rough rules in the generation of a 
performance map effectively yields a rough performance map, this being an 
approximation of the full map (see Table 1). 



Table 1. Decision Table and Derived Rules 



Ov 


Td 


Tr 


Ts 


D 


Derived Rules 


float (3) 


float (3) 


float (3) 


float (3) 


Bif 




0.000 


0.038 


0.068 


0.095 


mm 


Ov(O.OOO) AND Ts(0.095) -> D(0) 
Ov(0.050) AND Ts(0.090) -> D(l) 
Ov(0.050) AND Ts(0.095) -> D(l) 
Ov(O.lOO) AND Ts(0.090) -> D(2) 


0.050 


0.045 


0.070 


0.090 


1 


0.050 


0.040 


0.075 


0.090 


1 


0.050 


0.040 


0.070 


0.095 


1 


0.100 


0.040 


0.070 


0.090 


2 


0.100 


0.050 


0.080 


0.100 


3 


Ov(O.lOO) AND Ts(OTOO) -> D(3) 


0.100 


0.050 


0.080 


0.110 


4 


Ov(O.lOO) AND Ts(OTlO) -> D(4) 


0.100 


0.050 


0.090 


0.100 


3 


Ov(0.150) AND Ts(0.090) -> D(2) 
Ov(0.150) AND Ts(OTOO) -> D(3) 
Ov(0.200) AND Ts(0.090) -> D(2) 
Ov(0.200) AND Ts(OTOO) -> D(4) 


0.150 


0.050 


0.070 


0.090 


2 


0.150 


0.060 


0.080 


0.100 


3 


0.200 


0.040 


0.070 


0.090 


2 


0.200 


0.050 


0.080 


0.100 


4 


0.200 


0.040 


0.100 


0.100 


4 





Each tuple in Table 1 represents a set of parameter values and a decision value. 
This table is filled in an approximate manner by assigning a simple scale to reflect the 
levels of system performance. The scale elements [0, 4] contains decision values 
where 0 (zero) implies excellent performance and 4 implies very poor system 
performance. An excellent performance is defined to be the required performance 
level applied above, namely, maximum overshoot: 0^=0, maximum delay: 

= 0.038s, maximum rise time: = 0.068s and maximum settle time: = 0.095s. 

The remaining performance levels were applied using entirely verbal descriptions. A 
set of reducts was generated next; the process of which was entirely automated 
through the use of the Rosetta software package [12]. The result of the reduct 
generation was that the measurements of delay time tj and rise time T were redundant 
measurements and were deleted as unnecessary knowledge. The reducts are then 
maximum overshoot and settle time, {0^, t^}. Finally the set of rough rules shown in 
Table 1 was generated, which is easily programmed using IF-THEN-ELSE constructs 
for the generation of a rough performance map. 



5 Simulation of Linear Plant 

The derived rules in Table 1 are applied to the second-order linear plant described 
by (1). Derived rules lead to a “rough” performance map, which is an approximation 
of the full performance map shown in Fig. 3. Being an approximation, there is 
necessarily some loss of information, however, this loss is acceptable if the ultimate 
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goal of the visualization is to represent the parameter values that yield the required 
performanee. Fig. 6 eontains the rough performanee map generated for the seeond- 
order linear system. The blaek region of the map, whieh eomprises the majority of 
the upper right quadrant, refleets the parameter values, [(U„,f], that yield the 
optimum system response. 




niinrE|iMriiliny Rnijyl'i RnFKK 



U Uv <- ULU ARU I* C- U.uyb 



n Rv tr= n nn ANI> Ts <= fl HR 
Ov <- OL05 AND Ti ’J- 0 0D5 



I Ov c= 0 10AHD Tfc= D.09 
■ Rv c= n IHAUD Tk<t= n HRi 

Ov<-0 20ANDTs’;-D.0[} 



■ Ov <= 0 10 AND Trc= 0.10 
Ov nr- n 1!^ AND i- 010 



■ Ov i - 0 10 AND Ts <- 0.11 
Ov <= 0.20 AND Tt *:= 0.10 



□ Lithcr 



Fig. 6. Rough performance map of the 2^^ order linear system 



The fraetal eapaeity dimension, also known as the box eounting dimension, is 
based on Shannon’s information theory [13] and provides a geometrie measurement 
of an orbit in state spaee. The eapaeity dimension is based on a measurement of the 
number of boxes or eells, of size e, whieh are required to fully eover a system 
trajeetory on a phase plane. The eapaeity dimension model [14] is given in (3). 



Dc = lim 

s^O 



logA^h) 

M/e) 



(3) 



where N(e) ^ number of eells required to fully eover the trajeetory on the phase plane 
£ size of eaeh eell (eell is size F x e ). The eapaeity dimension is used to quantify the 
blaek region of a performanee map [4]. Let the map size be w x w pixels and let the 
eell size be one pixel. The size of eaeh eell is s ^ I In, Rather than count the number 
of cells required to cover a trajectory, count instead the number of colored black 
pixels. Thus, the capacity dimension of a rough or full performance map 
becomes (4) 



= lim 

£^0 



M/e) 



logn 



(4) 



The fractal or fractional capacity dimension, Dc, is employed to validate the rough 
where N(Pb) ^ count of black pixels, and n ^ width of the performance map in pixels. 
Using (4), the capacity dimension for the full performance map of Fig. 3 is 

1.788, while the dimension for the rough performance map of Fig. 6 is 1.794. This 
provides a validation that the region reflecting rough ruled optimum system 
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performance does indeed provide a good approximation of the parameters [CD^y C\ that 
affect the required system response. 



6 Concluding Remarks 

This paper has presented a method derived from a visualization technique common 
in the study of chaos, which can be used to visualize the effects of variations in 
parameters on a system’s response. The performance map is generated via digital 
computation, and with minimal need for rigorous mathematical analysis. Rough sets 
present a formal and intuitive paradigm whereby a decision table can be reduced to a 
minimal set of rules using approximate reasoning. 
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Abstract. We present an approximation spaee (U, R) whieh is an infinite (hyper- 
eontinuum) solution to the domain equation U ^ C(i?), the family of elementary 
subsets of U. Thus U is a universe of type-free sets and R is the relation of indis- 
eernibility with respeetto membership in other type-free sets. R thus assoeiates a 
family [u]i^ of elementary subsets with u ^ U, whenee (U, R) induees an gener- 
alized approximation spacQ {U,c :U U,i : U U), where c(u) = U[u]r 
and «(u) = n[u]ii. 



1 Rough Set Theory 

The theory of rough sets deals with the approximation of subsets of a universe U of 
points by two definable or observable subsets called lower and upper approximations 
([9]). It extends the theory of point sets by defining interior and closure operators over 
the power set 2^, typically those of the partition topology associated with an equiva- 
lence relation R on U. Approximately given concepts — so-called (topological) rough 
subsets — are identified with equivalence classes of subsets under the relation of sharing 
a common interior and closure ([9], [12]). The study of rough sets thence passes into ab- 
stract set theory, which studies sets of sets. Cattaneo [4] extended the method of point 
sets to topological rough sets by lowering the type of point sets, representing concepts as 
points in certain ordered structures, characterizable as modal algebras^ called abstract 
approximation spaces. This type lowering process, by which subsets form points, is the 
foundation of abstract set theory and the cornerstone of Frege’s attempt to derive math- 
ematics from logic ([2], [3]). We present rough set theory as an extension of the theory 
of abstract sets, and show that (on pain of Russell’s paradox) the formation of abstract 
sets in set theoretic comprehension is governed by the method of upper and lower ap- 
proximations. 

Our development marries the fundamental notion of rough set theory — the approx- 
imation space {U^R) — with that of abstract set theory — a type-lowering correspon- 
dence^ from 2^ to U under which subsets form points and vise-versa. The result is a 
proximal Frege structure: an approximation space placed in type-lowering correspon- 
dence with its power set. After indicating some of the foundational significance of such 
structures, we present a concrete example, called CI'Q ([!]). The elementary subsets of 
CI'G form a generalized approximation space characterizable as a non Kripkean modal 

^ As in [13] and [14] where these spaees are ealled generalized approximation spaces. 

^ Terminology established by J. L. Bell in [2]. 
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algebra. Finally, we propose equivalence classes under R — the basic information gran- 
ules from which all elementary subsets are composed - as concrete examples of neigh- 
borhood systems of deformable subsets as described in advanced computing [7], [8]. 

1.1 Approximation Spaces (Pawlak, 1982) 

Let [/ 7^ 0 and R C U x [/ be an equivalence relation. Then ([/, R) is said to be an 
approximation space. Pawlak [9] defines the upper and lower approximations of a set 
X C U, respectively, as: 

Cl{X) = {xeu\{3ye U){xRy A y e X)} = I n A ^ 0}, 

Int{X) = {xeu\{yye U){xRy ^ y e X)} = I ^ 

R is interpreted as a relation of indiscernibility in terms of some prior family of concepts 
(subsets of U). i?-closed subsets of U are called “elementary”. Cl is a closure operator 
over subsets of [/, while hit is a projection operator over the subsets of U which is dual 
to Cl under complementation. Further, ([/, C{R)) is a quasi-discrete topological space, 
where C(i?) is the family of elementary subsets of [/. In summary: 

10 Int{X) = U -Cl{U - X) 

11 Int{X n y ) = Int{X) n Int{Y) 

12 Int{U) = U 

13 Int[A) C A 

A hit (A) C hit {hit (A)) 
i5 Cl{X) = Int{Cl{Xj). 

It may be convenient to break il into a conjunction of inclusions, e.g., 

il(a) Int{X) n Int{Y) C Int{X H Y) 
il{b) hit{X r\Y) C Int{X). 

il{b) yields the familiar modal Rule of Monotonicity: 

RM: X CY CU ^ Int{X) C Int{Y). 

Principle il(a) is called “modal aggregation” by modal logicians and, when expressed 
in terms of Cl, additivity by algebraists. 

The quasi-discrete topological spaces are precisely the partition topologies, i.e., 
spaces ([/, C(i?)) with equivalence relations R CU x U. 



Rough Equality R induces a corresponding higher-order indiscernibility relation over 
subsets of [/: A sdixd XohQ roughly equal to Y (“A —r Y'') when Int{X) = Int{Y) 
and Cl{X) = Cl{Y). Let X.YhoR closed subsets ofU. Then X =rY ^ X ^Y. 
Thus, the i^-closed subsets of U are precisely defined by their respective approximation 
pairs. Equivalence classes [X]=^ are called rough (i.e., vague, tolerant) subsets of U . 
Alternatively, a rough subset of U can be identified with the pair of upper and lower 
approximations {In{X), Cl{X)) that bound [X]-^. 
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2 Type-Lowering Correspondences and Abstract Rough Sets 

2.1 The Set as One and the Set as Many 

Frege’s doctrine of concepts and their extensions is an attempt to state precisely Can- 
tor’s assertion that every “consistent multitude” forms a set. For every concept X C U, 
Frege postulated an object X £ U, “the extension of X,” which is an individual of U 
intended to represent X. Frege thus assumed the existence a type-lowering correspon- 
dence holding between U and its power set 2^, i.e., a function ^ ^ U mapping^ 

higher type entities (subsets) into lower type entities (objects). Frege attempted tc^gov- 
em his introduction of extensions by adopting the principle that the extensions X and 
Y are strictly identical just in case precisely the same objects (elements of U) fall under 
the concepts X and X. This adoption of Basic Law V, as Frege called it, was necessary 
if extensions where to represent concepts in the sense of characterizing precisely which 
objects fall under them. That this “Basic Law” leads to Russell’s paradox regarding the 
“the class of all classes which do not belong to themselves,” and is indeed equivalent 
in second order logic to the inconsistent principle of naive comprehension, is now well 
appreciated. 

Proximal Frege Structures Let ([/, R) be an approximation space and 
s :U ^ 2^ be flmctions.Ifinaddition,(l) T is di retraction, i.Q., 6(u) = w (i.e., ^ oe = 
1[/), and (2) the elementary subsets of [/ are precisely the X C U for which e(X) = X, 
then ([/, ,s) is called a proximal Frege structure. This may be summarized by the 

equation: 

C{R} w (7 < 2^, 

where C{R) is the family of i?-closed subsets of U and “U <\ 2^” indicates that 7 
projects the power set of U onto U. 

Let X = ([/, X, be a proximal Frege structure. Writing “i/i G ^ 2 ” for G 
^(^ 2 )” , we thus interpret [/ as a universe of type-free sets', e supports the relation of set 
membership holding between type-free sets (elements of [/); 2^ is Frege’s “extension 
function,” mapping concepts to the elements of U that serve as their extensions. X thus 
validates the Principle of Naive Comprehension 

uex^x{u), (PNC) 

for elementary (X-closed) subsets X of C. When X fails to be elementary, the equiva- 
lence also fails and is replaced by a pair of approximating conditionals: 

(1) 1/ G h^) ^X{u)] 

{2) X{u)^u^Cl{X). 

Note we use applicative grammar “X (w)” to indicate that u is an element of X C U, 
reserving “ G ” for type-free membership. We will write “{a? : X(a?)}” to denote X 

C U). 

^ See [3], where Bell observes that this funetion is a retraction from 2^ onto U whose seetion is 
eomprised of preeisely Cantor’s “eonsistent multitudes”. We eharaeterize the latter as preeisely 
the elementary subsets of U. 




Theorem 1. Let x,y G U. Then, both 

(a) (Vw) {u E X ^ u E y) ^ X = y; 

(b) {\fu) {x E u ^ y E u) ^ R{x, y). 
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Theorem 2. (Cocchiarella 1972) There are x,y E U such that R{x, y) but x ^ y. 
Assuming the failure of the theorem, Cocchiarella [5] derived Russell’s contradiction. 

The Boolean Algebra of Type-free Sets Since elements of U represent elementary sub- 
sets of [/, U forms a complete Boolean algebra under the following definitions of H, U 
and Let X C U, u E U] then 

UX -df {y : Ay E x)}; 

nX =df {y : (V^c)(X(^c) -Ey E x)}; 

-u =df {y - y ^ w}. 

Then {C{R),U, H, 0, U) is isomorphic to ([/, U, H, 0, U) under f C{R), the re- 
striction of the type-lowering retraction to elementary subsets of U. We define ui C U 2 
to be s{ui ) C s{u 2 ), i.e., inclusion of type-free sets is the partial ordering naturally 
associated with the Boolean algebra of type-free sets. 

The Discernibility of Disjoint Sets T is said to have the Dual Quine property iff: xC\y — 
0 => ^R{x, y) [x,y ElJ ,x f 0). 

Theorem 3. If T satisfies Dual Quine, then T provides a model ofPeano arithmetic. 

(Translations of) Peano’s axioms are derivable in first order logic from the PNC for ele- 
mentary concepts and Dual Quine. It for this reason that proximal Frege structures pro- 
vide a generalized model of granular computation. 

3 Cantor jFrege Qilmore Set Theory 
3.1 The Algebra of Q-sets 

Theorem 4. There is a proximal Frege structure [M^ax, = , ^ called CTQ, satis- 
fying Dual Quine. M^ax has the cardinality of the hypercontinuum. 

See the Appendix for a sketch of the domain theoretic construction of CTQ given in [ 1 ] . 
Elements of Mm ax are called “^-sets”. The complete Boolean algebra 

i^Mmax 5 ^max 1 C\max i max j Mmax ? 0) j 

is called the algebra of Q -sets, where the subscripted ''max'' (usually omitted) indicate 
that these operations are the one naturally associated with type-free sets. We write U for 
Mmax and 0 for 0. Recall that C{=) « Mmax < 

Let a E Mmax', define the outer penumbra of a, symbolically, <)a, to be the ^-set 
U[a]=; dually, define the inner penumbra. Da, to be the ^-set n[a] = . These operations. 
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called the penumbral modalities, are interpreted using David Lewis’ counterpart seman- 
tics for modal logic [6]. Your Lewisian counterpart (in a given world) is a person more 
qualitatively similar to you than any other object (in that world). You are necessarily 
(possibly) P iff all (some) of your counterparts are P. Similarly, if a and b are indis- 
cernible 0-sets, a is more similar to b than any other (i.e., discernible!) 0-set. Thus we 
call b a counterpart of a whenever a = b. Then Da (Oa) represents the set of 0-sets 
that belong to all (some) of a ’s counterparts. In this sense we can say that a G -set x nec- 
essarily (possibly) belongs to a just in case x belongs to Da (Oa). 

Plenitude 

Theorem 5. (Plenitude) Let a, b G then Da = a = Oa. Further, suppose a C b 

and a = b. Then, for all c E Mmax, aCcC6=>c = 6. 

Corollary 1. ([a] = , Omax, ^max, Da, Oct) is a complete lattice with least (greatest) el- 
ement Da (Oct). 

3.2 The Penumbral Algebra 

We call (Mm ax , U, n, — , U, 0, □, 0) penumbral algebra. It is a modal algebra sat- 
isfying iO, i2, i3, i4 and conjectured to be additive, i.e., to satisfy il(a); in addition, 

(ym c On?^, 

for all m G Mmax- R- E. Jennings has shown^ that, on pain of Russell’s contradiction. 
Rule Monotonicity 

Theorem 6. There are a,b E Mmax such that a C b ^ Da C □&, 
whence the converse of [C] does not hold in the penumbral algebra. 

Closed 0-sets, Open 0-sets Let a be a 0-set. Then, we say a is (penumbrally) closed if 
(ya — a\ C{M^ax ) is the set of closed 0-sets. Dually, a is (penumbrally) open if Da = 
a; Q){Mmax ) is the set of open 0-sets. By Dual Quine and Plenitude, the universe U and 
the empty set 0 are clopen (penumbrally both open and closed). We conjecture (based 
upon an analogy with the Rice-Shapiro and Myhill-Shepardson theorems in Recursion 
Theory) that they are the only clopen 0-sets. 

A Generalized Approximation Space Since elements of Mmax represent elementary 
subsets of ALmax-) I— I • M^max ^ ^(yM^max) and (y . M^max ^ ^(yM^max ) Can be 
interpreted as approximation operations over 0(=). Hence, the penumbral algebra is a 
generalized approximation space in which elementary subsets are approximated from 
above by penumbrally closed elementary subsets and from below by penumbrally open 
elementary subsets: Let i, c : 0(=) ^ 0(=) be maps given by 

i:X ^ s{nx) 

c : X ^s{<yX) 

In private correspondence. 



{xec{=)). 
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Then {C (=) , i, c) is an example of a generalized approximation space in almost the sense 
of Y. Y. Yao [14], since it satisfies iO, i2 — respectively, but fails il (6). In addition, we 
conjectured that (C(=), i, c) is additive, i.e., satisfies il{a). It is called a penumbral ap- 
proximation space. These approximation maps may be extended to the/w// power set 
^ yielding a non-additive (see [1]), non-monotonic, generalized approximation 
space , i, c) satisfying iO and i2 - i4. 

Symmetries of the Algebra of Counterparts Let a G Mmax- The penumbral border 
of a, B(a), is the ^-set Oa —max Note that B(0) = B(U) = 0. We define the 
maximum deformation of a, a*, to be the ^-set: 

(a U {x G B(a) : x ^ a}) — {a? G B(a) : x G a}. 

By Plenitude, a* = a. 

Theorem 7. ([a]=, Hmax, ^maxf , Oa) is a complete Boolean algebra with * as 
complementation. It is called the algebra of counterparts of a. 

More generally, let Y C B(a). We define the of a to bethel -set: (aU{^c G 

X : X ^ a}) — {x e X : X e a}. Again, by Plenitude, the Y -transform of a belongs 
to [a]=. Define fx - [a]= ^ [a]= by: fx{b) = the Y-transform of b {b ^ [a] = ). fx 
is called a deformation of [a] = . Let Qa be the family of all fx for Y C B(a), i.e., the 

family of all deformations of a. Note that /x(/x(^)) — ^ ^ M=)- 

Lemma 1. Qa is a group under the composition of deformations, with f^ the identity 
deformation and each deformation its own group inverse. 

Since (trivially) each deformation fx in Qa maps congruent (indiscernible) ^-sets to 
congruent Q -sets, Qa may be regarded as a symmetry group, every element of which has 
order 2. Since fx preserves maximal continuous approximations (see Appendix; in de- 
tail, [1]), Qa is a group of continuous transformations of ^-sets. 

Neighborhood Systems Neighborhood systems express the semantics of “nearby” or 
proximity. T. Y Lin [7] has proposed neighborhood systems as a generalization of rough 
set theory which is useful in characterizing the notion of a “tolerant subset” in soft com- 
puting. Let [/ be a universe of discourse and i/ G L ; a neighborhood of u is simply a 
non-empty subset Y ofU.A neighborhood system of u is family of neighborhoods of 
u, i.e., simply a family of subsets of U. A neighborhood system of U is the assignment 
of a neighborhood system of u to each u ^ U. 

Admissible Families of Elastically Deformable Subsets The notion of a neighborhood of 
a point can be “lifted” to the notion of the neighborhood of a subset. For example, two 
subsets of the real numbers can be said to be “near” if they differ by a set of measure 
0. Lin has proposed interpreting the neighborhood system of a subset as an admissible 
family of “elastically deformable” characteristic functions: 



A real world fuzzy set should allow a small amount of perturbation, so it should 
have an elastic membership function. [8], p. 1 An elastic membership function 
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can tolerate a small amount of continuous stretching with a limited number of 
broken points. ... We need a family of membership functions to express the 
stretching of a membership function. [7], p. 13. Mathematically, such an elas- 
tic membership function can be expressed by a highly structured subset of the 
membership function space. [8],p. 1. 

In [7], topological and measure theoretic means, presupposing the system of real 
numbers, are deployed to make this notion of a “small” amount of perturbation pre- 
cise. Since indiscernible ^-sets transform continuously into one another by swapping 
boundary elements, they are almost co-extensive, disagreeing only upon “exceptional 
points” or singularities (see [1]). The transformation is elastic in that the number of bro- 
ken points is negligibly small. We therefore offer the algebra of counterparts of a given 
0-set as a concrete example of an admissible family of “elastically deformable” charac- 
teristic functions, i.e., as a tolerant subset in the sense of Lin. 



4 Appendix: A Sketch of the Construction of C!FQ 

First, using the theory of SFP objects fromDomain Theory [11], we construct a contin- 
uum reflexive structure L>oo satisfying the recursive equation L>oo ^ [Doo ^T], where 
^ is order isomorphism, T is the complete partial order (cpo) of truth values 

true false 

\ ^ 

(here ± represents the zero information state, i.e., a classical truth value gap) and 
[Doo Too] is the space of all continuous (limit preserving) functions from Doo to 
T, under the information cpo naturally associated with nested partial functions: f < g 

iff /(d) <T g{d) {d G Too). 

The Kleene Strong three valued truth flmctional connectives are monotone (order 
preserving) with respect to T, and are trivially continuous. Unfortunately, [Too ^ T] is 
not closed under the infinitary logical operations such as arbitrary intersections, whence 
Too fails to provide a model of partial set theory combining unrestricted abstraction with 
universal quantiflcation. This is the rock upon which Church’s quantifled A-calculus 
foundered (essentially, Russell’s paradox). 

However,the characteristic property of SFP objects ensures that each monotone flmc- 
tion / : Too ^ T is maximally approximated by a unique continuous flmction c/ in 
[Too ^ T], whence c/ in Too underrepresentation. 

Next, the space M of monotone flmctions from Too to T is a solution for the reflex- 
ive equation M ^ {M T) where (M ^ T) is the space of all “hyper continuous” 
flmctions from M to T. A monotone function f : M ^ T is said to be hyper- continuous 
just in case = Cy => f{x) — f{y) {x, y G M). The elements of (M ^ T) corre- 
spond to the subsets of M which are closed under the relation of sharing a common max- 
imal continuous approximation. Indeed, = is the restriction of this equivalence relation 
to maximal elements of M. 
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We observe that M is closed under arbitrary intersections and unions; more generally 
it provides a first order model for partial set theory combining unrestricted abstraction 
with full quantification. 

Finally, since (M, <*n/o) is a complete partial order, M is closed under least upper 
bounds of <*n/o-chains. Hence, by Zorn’s Lemma, there are <*n/o -maximal elements 
of M. Let M^ax the set of maximal elements of M. Since 

(V*, y e Mmax)[x ^y'V X ^y], 

Mm ax is a classical subuniverse of M. 

Theorem 8 . Let a,b E Mmax- Then a = b iff Ca = C 5 . 
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Abstract. Rough Set Exploration System - a set of software tools fea- 
turing a library of methods and a graphical user interface is presented. 
Methods, features and abilities of the implemented software are discussed 
and illustrated with a case study in data analysis. 



1 Introduction 

Research in decision support systems, classification algorithms in particular 
those concerned with application of rough sets requires experimental verifica- 
tion. At certain point it is no longer possible to perform every single experiment 
using software designed for a single purpose. To be able to make thorough, multi- 
directional practical investigations one have to posess an inventory of software 
tools that automatise basic operations, so it is possible to focus on the most 
essential matters. That was the idea behind creation of Rough Set Exploration 
System, further referred as RSES for short. 

First version of RSES and the library RSESlib was released several years ago. 
The RSESlib is also used in computational kernel of ROSETTA - an advanced 
system for data analysis (see [16]) constructed at NTNU (Norway) which con- 
tributed a lot to RSES development and gained wide recognition. Comparison 
with other classification systems (see [11]) proves its value. 

The RSES software and its computational kernel - the new RSESlib 2.0 li- 
brary maintains all advantages of previous version. The algorithms from the first 
incarnation of library are now re-mastered to provide better flexibility, extended 
functionality and ability to process massive data sets. New algorithms added 
to the library reflect the current state of our research in classification methods 
originating in rough sets theory. 

The library of functions is not sufficient as an answer to experimenters’ de- 
mand for helpful tool. Therefore the RSES user interface was constructed. This 
interface allows to use RSESlib interactively. 

W. Ziarko and Y. Yao (Eds.): RSCTC 2000, LNAI 2005, pp. 106-113, 2001. 

© Springer-Verlag Berlin Heidelberg 2001 




RSES and RSESlib - A Collection of Tools for Rough Set Computations 107 



2 Basic notions 

In order to provide clear description further in the paper we bring here some 
essential definitions 

Information system ([12]) is a pair of the form A = {U^A) where is a 
universe of objects and A = (ai, a^) is a set of attributes i.e. mappings of the 
form ai : U ^ Va j where Va is called value set of the attribute The decision 
table is also a pair of the form A = (f/, AU {d}) where the major feature that is 
different from the information system is the distinguished attribute d. We will 
further assume that the set of decision values is finite. The th decision class 

is a set of objects Ci = {o ^ U : d{o) = where di is the th decision value 

taken from decision value set Va = {di, 

For any subset of attributes B d A indiscernibility relation IN D{B) is de- 
fined as follows: 



xIND{B)y ^ Vaesa(x) = a{y) (1) 



where x^y dU. 

Having indiscernibility relation we may define the notion of reduct. 5 C A is 
a reduct of information system if IN D{B) = IN D{A) and no proper subset of 
B has this property. In case of decision tables the decision reduct is a set B d A 
of attributes such that it cannot be further reduced and IN D{B) d IN D{d). 

Decision rule is a formula of the form 

((2^1 — r^i) A ... A {aij^ — d — (2) 

where 1< A < ••• < A ^ '^i C Va-. Atomic subformulae = r^i) are called 

conditions. We say that rule r is applicable to object, or alternatively, the object 
rrixitches rule, if its attribute values satisfy the premise of the rule. Support 
denoted as SuppA^r) is equal to the number of objects from A for which rule 
r applies correctly Match js^{r) is the number of objects in A for which rule r 
applies in general. Analogously the notion of matching set for a rule or collection 
of rules may be introduced (see [2], [4]). 

By cut for an attribute G A, such that Va- is an ordered set we will denote 
a value c G Va-. With the use of cut we may replace original attribute with 
new, binary attribute which tells as whether actual attribute value for an object 
is greater or lower than c (more in [8]). 

Template of A is a propositional formula /\(ui = vi) where ai d A and vi d 
lA- . A generalised template is the formula of the form [\{ai d Ti) where Ti d Va^. 
An object satisfies (matches) a template if for every attribute Oj occurring in the 
template the value of this attribute on considered object is equal to Vi (belongs to 
Ti in case of generalised template). The template induces in natural way the split 
of original information system into two distinct subtables containing objects that 
do or do not satisfy the template, respectively. Decomposition tree is a binary 
tree, whose every internal node is labeled by some template and external node 
(leaf) is associated with a set of objects matching all templates in a path from 
the root to a given leaf (see [7]). 
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3 The RSES library v. 2.0 

The RSES library (RSESlib) is constructed according to the principles of object 
oriented programming. The programming language used in implementation is 
Microsoft Visual C++ compliant with ANSI/ISO C++ (ISO/IEC 14882 stan- 
dard) . 

The algorithms that have been implemented in the RSES library fall into 
two general categories. 

Eirst category gathers the algorithms aimed at management and edition of 
data structures that are present in the library. 

The algorithms for performing Rough Set theory based operations on data 
constitute the second, most essential kind of tools implemented inside RSES 
library. To give the idea what apparatus is given to the user we describe shortly 
the most important algorithms. 

Reduction algorithms i.e algorithms allowing calculation of the collections of 
reducts for a given information system (decision table). The exhaustive algorithm 
for calculation of all reducts is present, however such operation may be time- 
consuming due to computational complexity of such task (see [13]). Therefore 
approximate and heuristic solutions such as genetic or Jonhson algorithms were 
implemented (see [14], [6] for details). The library methods for calculation of 
reducts allow setting initial conditions for number of reducts to be calculated, 
required accuracy, coverage and so on. Basing on calculated reduct it is possible 
to calculate decision rules (see [4] ) . Procedures for rule calculation allow user to 
determine some crucial constrains for the set of decision rules. Rules received are 
accompanied with several coefficients that are further used while the rules are 
being applied to the set of objects (see [3], [2]). In connection with algorithms for 
reduct /rule calculation appear the subclass of algorithms allowing shortening of 
rules and reducts with respect to different requirements (see [3]). 

Discretisation algorithms allow to find cuts for attributes. In this way initial 
decision table is converted to one described with less complex, binary attribute 
without lose of information about discernibility of objects (see [10], [8], [2]). 

Template generation algorithms provide means for calculation of templates 
and generalised templates. Placed side by side with template generation are the 
procedures for inducing table decomposition trees (see [7] and [9]). 

Classification algorithms used for establishing decision value with use of de- 
cision rules and/or templates. Operations for voting among rules with use of 
different schemes fall into this category (see [3], [9], [4], [2]). 

During operation certain functions belonging to RSESlib may read and write 
information to/from files. Most of the files that can be read or written are regular 
ASCII text files. They particular sub-types can be distinguished by reviewing 
the contents or identifying file extensions. 

4 The RSES GUI. 

To simplify the use of RSES library and make it more intuitive a graphical user 
interface was constructed. This interface allows interaction with library methods 
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in two modes. First is directed towards ease of use and visual representation of 
workflow. The other is intended to provide tools for construction of scripts that 
use RSES functionality 

4.1 The project interface. 

Project interface window (see Figure 1) consists of two parts. Upper part is 
the project workspace where icons representing objects occurring during our 
computation are presented. Lower part is dedicated to messages, status reports, 
errors and warnings produced during operations. It was designers intention to 
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Fig. 1. The project interface window. 



simplify the operations on data within project. Therefore, the entities appearing 
in the process of rough set based computation are represented in the form of 
icons placed in the upper part of workplace. Such an icon is created every time 
the data (table, reducts, rules,...) is loaded from the file. User can also place 
an empty object in the workplace and further fill it with results of operation 
performed on other objects. The objects that may exist in the workplace are: 
decision table, collection of reducts, set of rules, decomposition tree, set of cuts 
and collection of results. Every object (icon) appearing in the project have a 
set of actions connected with it. By right-clicking on the object the user invokes 
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a context menu for that object. It is also possible to call the necessary action 
form general pull-down program menu in the main window. Menu choices allow 
to view and edit objects as well as make them input to some computation. In 
many cases choice of some option from context menu will cause a new dialog 
box to open. In this dialog box user can set values of coefficients used in desired 
calculation, in particular, designate the variable which will store the results of 
invoked operation. If the operation performed on the object leads to creation of 
new object or modification of existing one then such a new object is connected 
with edge originating in object (or objects) which contributed to its current 
state. Setting of arrows connecting icons in the workspace changes dynamically 
as new operations are being performed. 

The entire project can be saved to file on disk to preserve results and in- 
formation about current setting of coefficients. That also allows to re-create the 
entire work on other computer or with other data. 

4.2 Scripting interface. 

In case we have to perform many experiments with different parameter settings 
and changing data it is more convenient to plan such a set of operations in 
advance and then let computer calculate. The idea of simplifying the preparation 
and execution of compound experiments drove the creation of RSES scripting 
mechanism. 

The mechanism for performing script-based operations with use of RSES 
library components consists of three major parts. The scripting interface is the 
part visible to user during script preparation. Other two are behind the scenes 
and perform simple syntax checking and script execution. We will not describe 
checking and executing in greater detail. 

The user interface for writing RSES based scripts is quite simple. Main win- 
dow is split into two parts of which upper contains workplace where scripts 
are being edited and lower contains messages generated during script execution 
(Eigure 2). 

The RSES scripting language constructs available to user are: 

— Variables. Any variable is inserted to script with predefined type. The type 
may be either standard (e. g. integer, real) or RSES-specific. 

— Eunctions. User can use all the functions from RSES armory as well as stan- 
dard arithmetic operations. Eunction in script always returns a value. 

— Procedures. The procedures from RSES library, unlike functions, may not 
return a value. The procedures correspond to major operations such as: 
reduct calculation, rule shortening, loading and saving objects. 

— Conditional expressions. The expressions of the form If ... then ... else may 
be used. The user defines condition with use of standard operations. 

— Loops. Simple loop can be used within RSES script. The user is required to 
designate loop control variable and set looping parameters. 

While preparing a script the user may not freely edit it. He/she can only insert 
or remove one of the constructs mentioned above using context menu which 
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Fig. 2. The script interface window. 



appears after right-clicking on the line in the script where insertion/removal 
is due to occur. The new construct may also be inserted with use of toolbar. 
The operation of inserting the new construct involves choosing the required 
operation name and setting all required values. All this is done by easy point- 
and-click operations within appropriate dialog box supported by pull-down lists 
of available names. The edition operations are monitored by checker to avoid 
obvious errors. 

Once the script edition is finished it can be executed with menu command 
Run. Before execution the syntax is checked once again. The behaviour of cur- 
rently running script may be seen in the lower part of interface window. 



5 Case study - decomposition by template tree 

As already mentioned, the ability of dealing with large data sets is one of key new 
features of RSESlib 2.0. To deal with such a massive data we use decomposition 
based on template-induced trees. 

Decomposition is a problem of partitioning a large data table into smaller 
ones. One can use templates extracted from data to partition data table into 
blocks of objects with common features. We consider here decomposition schemes 
based on a template tree (see [7]). The main goal of this method is to construct a 
decomposition tree. Let A be a decision table. The algorithm for decomposition 
tree construction can be presented as follows: 
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Algorithm 1 Decom^position hy template tree (see [7]) 

Step 1 Find the best template T in A. 

Step 2 Divide A onto two subtables: Ai containing all objects satisfying T 
and A 2 = A — Ai . 

Step 3 If obtained subtables are of acceptable size (in the sense of rough set methods) 
then stop 

else repeat 1-3 for all ”too large” subtables. 

This algorithm produces a binary tree of subtables with corresponding sets of 
decision rules for subtables in the leaves of the tree. 

The decision tree produced by algorithm presented below can be used to 
classify a new case to proper decision class. Suppose we have a binary decompo- 
sition tree. Let by a new object and A (7) be a subtable containing all objects 
matching template T. We classify object u starting from the root of the tree as 
follows: 

Algorithm 2 Classification by template tree (see [7]) 

Step 1 If u matches template T found for A 

then: go to subtree related to A( T) 
else: go to subtree related to A(^T). 

Step 2 If w is at the leaf of the tree then go to 3 

else: repeat 1-2 substituting A(T) (or A(-iT)) for A. 

Step 3 Classify u using decision rules for sub table attached to the leaf 

This algorithm uses a binary decision tree, however it should not be mistaken 
for C4.5, ID3. As we told before, in our experiments (see Section 5.1), a rough 
set methods have been used for classifying algorithm construction in leaves of 
the decomposition tree (see [2] and [3] for more details). 

5.1 Experiments with Forest CoverType data 

The Forest CoverType data used in our experiments were obtained from US 
Forest Service (USFS) Region 2 Resource Information System (RIS) data (see [5] 
[17]). Data is in rectangle form with 58I0I2 rows and 56 columns (55 + decision). 
There are 7 decision values. This data has been studied before using Neural 
Networks and Discriminant Analysis Methods (see [5]). 

The original Forest CoverType data was divided into a training set (1 1340 ob- 
jects), a validation set (3780 objects) and a testing set (565892 objects) (see [5]). 
In our experiments we used algorithms presented in Section 5 that are imple- 
mented in RSESlib. The decomposition tree for data have been created only by 
reference to the training set. The validation set was used for adaptation of clas- 
sifying algorithms which obtained in the leaves of the decomposition tree (ses 
[2] and [3] for more). The classification algorithm was applied to new cases from 
the test set. As a measure of classification success we use accuracy (see e.g. [7], 
[5]). The accuracy we define as the ratio of the number of properly classified 
new cases to the total number of new cases. Three different classification sys- 
tems applied to the Forest CoverType data have given accuracy of 0.70 (Neural 
Network), 0.58 (Discriminant Analysis) and 0.73(RSES). 
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Abstract. The Variable Precision Rough Sets Model (VPRS) is an extension of 
the original Rough Set Theory. To employ VPRS analysis the decision maker 
(DM) needs to define satisfactory levels of quality of classification and p 
(confidence) value. This paper considers VPRS analysis when the DM only 
defines a satisfactory level of quality of classification. Two criteria for selecting 
a jS-reduct under this condition are discussed. They include the use of 
permissible p intervals associated with each jS-reduct. An example study is 
given illustrating these criteria. The study is based on US state level data 
concerning motor vehicle traffic fatalities. 



1 Introduction 

The Variable Precision Rough Sets Model (VPRS) ([11], [12]), is an extension of the 
original Rough Set Theory (RST) ([6], [7]). To employ VPRS analysis the decision 
maker (DM) needs to define satisfactory levels of quality of classification and P 
(confidence) value. VPRS related research papers (see [1], [5], [11], [12]) do not 
focus in detail on the choice of P value. There appears to be a presumption this value 
will be specified by the decision maker (DM). In this paper we consider VPRS 
analysis when only information on the satisfactory level of quality of classification is 
known. The lessening of a priori assumptions determined by the DM should allow 
VPRS to be more accessible for analysis of applications. 



2 Preliminaries 

Central to VPRS analysis is the information system, made up of objects each 
classified by a set of decision attributes D and characterized by a set of condition 
attributes C. A value denoting the nature of an attribute to an object is called a 
descriptor. All descriptor values are required to be in categorical form allowing for 
certain equivalence classes of objects to be formed. That is, condition and decision 
classes associated with the C and D sets of attributes respectively. These are used in 
RST to correctly classify objects. 



W. Ziarko and Y. Yao (Eds.): RSCTC 2000, LNAI 2005, pp. 1 14-122, 2001. 
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In contrast to RST when an object is classified in VPRS there is a level of 
confidence (P threshold) in its correct classification. The {3 value represents a bound 
on the conditional probability of a proportion of objects in a condition class being 
classified to the same decision class. Following [1] the value j3 denotes the proportion 
of correct classifications, in which case the domain of j3 is (0.5, 1.0]. 

For a P value and decision class, those condition classes which have the largest 
group proportion of objects classified to the decision class is at least {3, are classified 
to the decision class. Similarly, there are the condition classes definitely not classified 
since the proportion of objects in a condition class classified to the decision class does 
not exceed I - f3. These sets are known respectively as the ^-positive and ^-negative 
regions.^ The set of condition classes whose proportions of objects classified to the 
decision class lie between these values I - f3 and J3 is referred to as the ^8-boundary 
region. Defining the universe U to refer to all the objects in the information system 
characterised by the set C (with Z e t/ and P e Q, then: 

j8-positive region of the set Z: U Pr(z|x )>^3 ^ ■> 

^-negative region of the set Z: Upj.( 7 |x.)<i_^ ^ EC{P)} , 

^-boundary region of the set Z: Ui_^<pi-(z|x.)<^ ^ ? 

where EC{ ) denotes a set of equivalence classes (in this case, condition classes based 
on a subset of attributes P). Having defined and computed measures relating to the 
ambiguity of classification, [11] define in VPRS the measure of quality of 
classification, ^{P, D), (with P e Q, given by 

n, _ I Z s £C(0)j) 

card (U) 

for a specified value of j3. The value '/(P, D), measures the proportion of objects in 
U for which classification is possible at the specified value of f3. This measure is used 
operationally to define and extract reducts. In RST a reduct is a subset of C offering 
the same information content as given by C and is used in the rule construction 
process. Formally in VPRS from [12], an approximate (probabilistic) reduct RED\C, 
D), referred to here as a ^S-reduct, has the twin properties: 

1. /(C,D) = /(i?£D^C,D),D), 

2. No proper subset of RED^ (C, D), subject to the same /? value can also give the 
same quality of classification. 

Since more than one ^-reduct may exist, a method of selecting an appropriate 
(optimum) ^-reduct needs to be considered. 



^ VPRS model was extended by [3], [4] to ineorporate asymmetrie bounds on elassifieation 
probabilities. We restriet our attention here, without loss of generality, to the original VPRS. 
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3 Explanation of Data 

To illustrate the findings (methods of ^-reduct selection) in this paper, a real world 
example is considered. That is, an analysis of the alcohol related traffic fatalities 
within the 50 US states plus the District of Columbia. More specifically the decision 
attribute {d^ is the estimated percent of alcohol related traffic fatalities.^ Table 1 gives 
the chosen intervals for and the number of states in each decision class. 



Table 1. Intervals of d^ defining the deeision elasses 



Interval label 


T’ 


’2’ 


'3' 


Interval, number of states 


ro, 25), 6 


[25, 34), 21 


[34, 1001, 24 



To illustrate further the decision classes constructed. Fig. 1 shows geographically 
the relative positions of the states in each decision class. 




Fig. 1. Geographical illustration of deeision elasses 

Seven condition attributes are used to characterize the motor vehicle accidents (see 
Table 2) and relate to demographic, behavioural and social-economic factors. They 
were chosen based on their previous use in the related literature (see [10]). 

Each of the condition attributes shown in Table 2 is continuous in nature. Since 
VPRS requires the data to be in categorical form the Minimum Class Entropy method 
(see [2]) is used to discretize the data.^ This is a local supervised discretization 
method which considers a single condition attribute at a time and utilizes the decision 
class values. This method requires the DM to choose the number of intervals to 



^ That is, motor vehicle accidents which involved a person with a Blood Alcohol Concentration 
level greater than or equal to 0.1 g/dl. 

^ This discretisation method has been previously used in [9]. 
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categorize a condition attribute. In this study each condition attribute was discretized 
into two intervals (labeled 'O' and M '). 



Tabte 2 . Descriptions of condition attributes"^ 



Attribute Description 


Function 




Population density Population (1,00(7 s)/Size of state (sq. miles) 




Youth Cohort 


Percentage of state’s population aged 1 5 to 24 




Income 


Median hoasehold income ($) 


<•’4 


Education 


Percentage of population high school graduate or higher 




Speed 


Proportion of fatalities speed related 




Seat belt use 


Percentage use of seat belts 





Dri V er" s I ntensity V ehic 1 e Mi 1 es T rave lie d (m ile s)/L ic eased Dr i v ers 



A visual representation of tJie boundaiy values associated with the inteiTals '0' and T 
for each condition attnbute are given in Fig. 2. 
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Fig, 2. Intervals (witti boundary values) for Ltie eoncltioii alEiibulcs (i?,, 



In Fig. 2 below each interval label tJie number of states whose continuous atti ibute 
value lies within that interv’al is given, fo illustrate, the attribute (Education) is 
desenbed by the intervals [76.4, 89.8) and [89.8, 92.0] which have 48 and 3 states 
within each interval respectively. The result of the discretization on the condition 
attributes is an infonriHlion system denoted here as IS^Jg. 



4 VFRS Analysis 

The first stage of the VFRS analysis is to calculate the different levels of quality of 
classilication, i.e. 7 ^(C, D) which exist. There are tour different levels of 7 ^(C, D) 



^ Condition attribute data was found &om the U.S. Department of Transportation National 
I lighway Safety Administration and the U.S. (.’ensm Unrean 
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associated with these are given in Table 3 along with the associated intervals of 
P values. 



T3l)Le 3 . (^lahty of classification levels and their associated /? intervals 

/(c, D) i ^7/?i i9/?i i2/?i 

g internal (0.5,0.643) [0.643, 0.667) [0.667, 0.714) [0.714,1] 

To illus Irate the results in Tahk X /((7, D) 1 is only attained for a value of fi in 
the interval (0.5, 0.643). The next stage is to calculate ail the /?-reducts associated 
with for the varying levels of D). Tig. 3 shows all the subsets of attributes 
with their intervals of permissible P values making them ^reducts, based on the 
criteria given in section 2 (and [ 1 2J). 



C 
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{c,} 
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{Ci, Cj, C4, Ce, 

(Cj, Cj, C7) 

{Ci, Cj, C3} 
1^2, Cj} 

(Cj, ^3, C3, ^3, C7} 

{C^i?3,C4iA. <h] 

{Cj, Cji ^4} 
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Fig. 3, Visual reprcsciilatioii of /J-rcducts associated with IS^, 



From Fig. 3, the subset of attributes {c*,, c.^} is a /3-re duct for a /3 value chosen in 

the interval 0.5 to 0.512. There arc 21 /3-rcducts asscx:iated with TS^g. Each different 
level of 7^(C, D) has a number of associated j3-reducts. For /(C, D) = 1 there are 17 
/J-reducts. For the other three levels of y^((7, D) there are only one or two associated 
/3-reducts. With all ^reducts identified, the most appropnate (optmium) /3-reduct is 
needed to be selected to continue the analysis. This would require knowledge on what 
the DM believes are satisfactory levels of y^CC, D) and p value. If a ^ level is known 
(using Fig. 3) a vertical line can he drawn for that /3 value and any /3-reduct can be 
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chosen for which part (or all) of their permissible P interval lie to the right of the 
vertical line.^ This is since a {3 value is implicitly considered the lower bound on 
allowed confidence. 

In this study no information is given on a satisfactory j3 value, only a satisfactory 
level of 7^(C, D) is known. Hence only those ^-reducts are considered which lie in 
levels of 7^(C, D) greater than or equal to the satisfactory level given. Under this 
consideration we offer two methods to the selection of j8-reducts. 



4.1 Method 1: Largest Value 

The first method chooses a ^-reduct which has the highest permissible value of {3 
associated with it from within the intervals of permissible f3 given by the satisfactory 
levels of 7^(C, D). To illustrate, in the case of y^(C, D) = I required, the j3 interval is 
(0.5, 0.643). Hence a j8-reduct with a permissible j3 value nearest to 0.643 will be 
chosen. In this case there is only one ^-reduct (see Fig. 3), i.e. c^} would be 
chosen since its upper bound on permissible f3 is also 0.643. 

Using this method there may be more than one j8-reduct which have the same 
largest permissible f3 value. In this case other selection criteria will need to be used, 
including least number of attributes etc.. At first sight this would seem an appropriate 
method to find the optimum ^-reduct. However this method is biased towards 
obtaining a ^-reduct associated with the y^(C, D) level nearest to the minimum level 
of satisfactory 7^(C, D) (a direct consequence of requiring largest (3 value possible). 
Where there are a number of different levels of 7^(C, D) within which to find an 
optimum ^-reduct, this method allows little possibility for a ^-reduct to be chosen 
from the higher 7^(C, D) levels. 



4.2 Method 2: Most Similar Interval 



The second method looks for a ^-reduct whose interval of permissible f3 values is 
most similar to the j3 interval for the y^(C, D) level it is associated with. To illustrate 
in the case y^(C, D) = 1, with j3 e (0.5, 0.643) the ^-reduct {c^} (from Fig. 3) has a 
permissible interval of f3 equal to (0.5, 0.619). This interval is most similar to the 
interval (0.5, 0.643) associated with y^(C, D) = 1. Using this method, the 
consideration is at a more general level, i.e. the chosen ^-reduct is offering the most 
similar information content as C. The measure of similarity is not an absolute value 
but a proportion of the interval of the particular level of 7^(C, D). 

To be more formal defining R and B ^ to be the lower and upper bounds on 

— C,Y ' 

the j3 interval for a level of quality of classification y (i.e. y^(C, D)) based on C. A 
similar set of boundary points exist for each fl-reduct (say R), i.e. B and B j. .It 

— R,Y ^ 



^ p intervals have been eonsidered before in [11] when defining absolute and relatively rough 
eondition elasses. 
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follows the respective ranges of the {3 intervals can be constructed, for 7^(C, D) it is 
\R \=B^ -B . Hence the optimum fl-reduct (R) to choose, based on a 

|"C,7| I L,Y !_c,y 



minimum level of quality of classification required. 



\Pr,y |/|^c,r| 



This 



|/|^c,r 



is the R, for which 
value represents the 



is the largest with y > y„ 

proportion of similarity in {3 interval sizes between a given ^-reduct R and the y^(C, 
D) level it is associated with. More than one ^-reduct may be found using this 
method, in the same or different y^(C, D) levels 



Considering only y^(C, D)= I demonstrates these two methods can produce different 
results. Indeed the two measures can be combined to choose a ^-reduct. 



5 An Example of j8-Reduct Selection Method 2 

Using the study exposited in section 3, and presuming for example the DMs only 
requirement is for the VPRS analysis to give classification to at least 66.667% of the 
states. With at least 34 states needed to be given a classification, the optimum j3- 
reduct can be found from the largest two levels of y^(C, D), i.e. satisfying y^(C, D) > 
34/51. Inspection of the ^-reducts relating to the similarity measure given in method 2 
indicates {ci, C2, C3, C5, C7} (say R) is the optimum ^-reduct. This is because it has the 
same permissible j3 interval as y^(C, D) = 37/51, i.e. r|/|^c r | ^ 1- No other j3- 

reduct associated with y^(C, Z)) = 1 or y^(C, D) = 37/51 has this level of similarity. 

Following the rule construction method given in [1] the minimal set of rules for the 
^-reduct {ci, C 2 , C3, C5, Cj} are constructed, see Table 4. It shows there are 6 rules to 
classify the 37 states. 



Table 4. Minimal set of rules assoeiated with jS-reduet {c^, cj 



Rule 






<^7 


<^7 




^7 






Strength 


Correct 


Proportion 


1 


If 




1 








then 


1 


1 


1 


1 


2 


If 


1 




0 




1 


then 


2 


7 


5 


0.714 


3 


If 




0 




0 




then 


2 


6 


4 


0.667 


4 


If 






1 


1 




then 


3 


16 


12 


0.75 


5 


If 


0 










then 


3 


6 


6 


1 


6 


If 










0 


then 


3 


1 


1 


1 



In Table 4 the strength column relates to how many states the rule classifies (correctly 
or incorrectly), hence indicates 37 states are given a classification. The correct 
column indicates how many states are correctly classified by the rule, it totals 29 
indicating 78.378% of those states given a classification were correctly classified. The 
final column gives the proportion of states a rule correctly classified. Fig. 4 gives a 
geographical illustration of the classification (and non-classification) of all the states 
by the rules given in Table 4. 



An Investigation of B-Reduct Selection 121 




Fig. 4. Geographical representation of classification of states by rules in Table 4 

In Fig. 4 each state (which was given a classification) is labelled with the actual 
rule it is classified by. Further analysis of the rules and their classifications in this 
study are beyond the focus of this paper. An interesting point to note is that the states 
Utah (UT) and District of Columbia (DC) are each classified by their own rule (rules 
1 and 6 respectively). This follows the reasoning in related literature, which considers 
these to be outlier states due to their sociaFbehavioral (UT) and demographic (DC) 
properties. 

This study has illustrated the automation of the ^-reduct selection process within 
VPRS analysis. The analysis of permissible intervals of {3 has enabled a more 
generalised process of ^-reduct selection. However since finding reducts is an NP 
complete problem in RST [8] it follows the problem is more complicated in VPRS. 
Further aspects using the principle of permissible {3 intervals are defined which may 
help in finding efficiently the optimum ^-reduct in VPRS analysis. 

- Spanning set of P-reducts; A spanning set of ^-reducts is the least number of P~ 
reducts whose permissible P intervals cover the full interval of P values (without 
overlapping) associated with a particular level of y^(C, D). In the case of IS^^ for 
when 7^(C, D)= 1, the ^-reducts {c^} and {c^, c^} form such a spanning set. There 
may exist more than one spanning set for the same level of quality of classification. 

- Root attributes: The core of a spanning set of ^-reducts forms the root. For IS^s 
when 7^(C, D)=\ the set of attributes {c^} is a root. 

- Onto p. This definition relates to a single quality of classification or information 
system as a whole. If it is shown a ^-reduct exists for any allowable P considered 
then the system is onto p. This onto principle may effect any discretisation of 
continuous attributes, since a pre-requisite may be for the discretisation to make 
sure the information system is onto p. That is the DM is assured they can choose 
any P value in their analysis and be able to find a ^-reduct. 
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6 Conclusions 

An important aspect in this paper is an understanding of the interval of permissible {3 
values which effect the different levels of quality of classification and when a subset 
of attributes is a ^-reduct. This understanding is of particular importance when no 
knowledge on a specific satisfactory {3 value is given by the decision maker. 

While it may be more favorable for the DM to provide a J3 value, this paper 
introduces two methods of selecting a ^-reduct without such a known f3 value. 
Method 2 in particularly finds the optimum ^-reduct which offers an information 
content most similar to that of the whole set of attributes. 

By relaxing some of the a priori information a DM needs to consider before VPRS 
analysis can be undertaken it is hoped the methods discussed in this paper may 
contribute to VPRS being used more extensively in the future. 
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Abstract. We provide the unified methodology for searching for ap- 
proximate decision reducts based on rough membership distributions. 
Presented study generalizes well known relationships between rough set 
reducts and boolean prime implicants. 



1 Introduction 

The notion of a decision reduct was developed within the rough set theory ( [2] ) 
to deal with subsets of features being appropriate for description and classifi- 
cation of cases within a given universe. In view of applications, the problem of 
finding minimal subsets (approximately) determining a specified decision attri- 
bute turned out to be crucial. Comparable to the problem of finding minimal 
(approximate) prime implicants for boolean functions ([1]), it was proved to be 
NP-hard ( [8] ) . On the other hand, relationship to the boolean calculus provided 
by the discernibility representation enabled to develop efficient heuristics finding 
approximately optimal solutions (cf. [6], [7]). 

In recent years, various approaches to approximating and generalizing the 
decision reduct criteria were developed (cf. [6], [9]). An important issue here is to 
adopt original methodology to be able to deal with indeterminism (inconsistency) 
in data in a flexible way. Basing on the discernibility characteristics, one can say 
that a reduct should be meant as an irreducible subset of conditional features, 
which discerns all pairs of cases behaving too differently with respect to a pre- 
assumed way of understanding inexact decision information. Another approach 
is to think about a reduct as a subset, which approximately preserves initial 
information induced by the whole of attributes. 

In this paper we focus on frequency based tools for modeling inconsistencies 
(cf. [3], [4], [9], [10]). It is worth emphasizing that introduced notion of a (ap- 
proximate) frequency decision reduct remains in an analogy with the notion of 
a Markov boundary^ which is crucial for many applications of statistics and the 
theory of probability (cf. [5]). 

In Section 2 we outline basic notions of rough set based approach to data 
mining. In Section 3 we recall the relationship between the notion of a rough 
set decision reduct and a boolean prime implicant. Section 4 contains basic 

W. Ziarko and Y. Yao (Eds.): RSCTC 2000, LNAI 2005, pp. 123-130, 2001. 
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facts concerning the frequency based approach related to the notion of a rough 
membership function. In Section 5 we generalize the notion of a //-decision reduct 
onto the parameterized class of distance based approximations. In Section 6 we 
illustrate the process of extracting approximate frequency discernibility tables. 

2 Decision Tables and Reducts 

In the rough set theory ([2]), a sample of data takes the form of an information 
system A = (f/, A), where each attribute a £ A is identified with function a : 
U ^ Va from the universe of objects U into the set \4 of all possible values on 
a. Reasoning about data can be stated as, e.g., a classification problem, where 
the values of a specified decision attribute are to be predicted under information 
over conditions. In this case, we consider a triple A = (f/, A, d), called a decision 
table, where, for the decision attribute d ^ A, values Vd ^ Vd correspond to 
mutually disjoint decision classes of objects. 

Definition 1. Let A = (?7, A, d) and ordering A = (ai, . . . , U|^|) be given. For 
any R C the B -ordered information function over U is defined by 

= (aii(M),...,ai|^l(M)) (1) 

The B -indiscernibility relation is the eguivalence relation defined by 

INDfdB) = {{u,u') eU X U :Trtsiu) =Trfsiu')} (2) 

Each u £ U induces a B -indiscernibility class of the form 

[u]b =W eU: {u, u^) e INDa{B)} (3) 

which can be identified with vector Inf ^ (u). 

Indiscernibility enables us to express global dependencies among attributes: 
Definition 2. Let A = (f/. A, d) be given. We say that B C A defines d in A iff 

INDpfB) C IND,f{d}) (4) 

or, eguivalently^ iff for any u £ U A satisfies the object oriented rule of the form 

yy (a = a{u)) (d = d{u)) (5) 

aC_B 

We say that B C A is a decision reduct iff it defines d and none of its proper 
subsets does it. 

Given B C A which defines d, we can classify any new case Unew ^ U hj decision 
rules of the form (5). The only requirement is that A must recognize Unew with 
resp ect to R, i.e., the combination of values observed for Unew must fit vector 
Infs (w) for some u E U. Expected degree of the new case recognition is the 
reason for searching for (approximate) decision reducts, which are of minimal 
complexity, understood in various ways (cf. [6], [7], [8], [9]). 
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3 Relationships with Boolean Reasoning 

Results provided by the rough set literature state the problems of finding mini- 
mal (minimally complex) decision reducts as the NP-hard ones (cf. [6], [8], [9]). 
It encourages to develop various methods for the effective search for (almost) 
optimal attribute subsets. These methods are often based on analogies to the 
optimization problems known from other fields of science. For instance, let us 
consider the following relationship: 

Proposition 1. ([S]) Let A = {U^A^ d) he given. The set of all decision reducts 
for A is equivalent to the set of all prime implicants of the boolean discemibility 
function 

/A(ai, . . . ,a|A|) = A V ® 

where variables ~d correspond to particular attributes a £ Ay and where for any 

ij = 1,...,|/7| 



\ 0 otherwise 

The above result enables to adopt heuristics approximating the solutions of the 
well known problems of boolean calculus (cf. [1]) to the tasks concerning deci- 
sion reducts (cf. [7], [8]). Obviously, the size of appropriately specified boolean 
discemibility functions influences crucially the efficiency of adopted algorithms. 

Definition 3. Let A = (P, A, d) be given. By the discemibility table for A we 
mean the collection of attribute subsets defined by 

Ta = {T CA:T^0A = aj)} (8) 

To obtain better compression of the discemibility function, one can apply the 
absorption law related to the following characteristics. 

Definition 4. Let A = (P, A, d) be given. By the reduced discemibility table for 
A we mean the collection of attribute subsets defined by 

Tf = {r e Ta : - 3T'eT,. W ^ T)} (9) 

Proposition 2. Let A = {U^Ay d) be given. Subset B C A defines d in A iff it 
intersects with each element of the reduced discemibility table oy equivalen- 
tly y iff it corresponds to an implicant of the boolean function 

gA{ai, ■ ■ ■,a\A\) = A V® 

Experiences concerning the performance of boolean-like calculations over discer- 
nibility structures (cf. [6], [7], [8]) suggest to pay a special attention to possibility 
of re-formulation of the above relationship for other types of reducts. 
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4 Rough Membership Distributions 



In applications, we often deal with inconsistent decision tables A = (f/, A, d), 
where there is no possibility of covering the whole of universe by the exact 
decision rules of the form (5). In case of such a lack of complete conditional 
specification of decision classes, one has to rely on a kind of representation of 
initial inconsistency, to be able to measure its dynamics with respect to the 
feature reduction. We would like to focus on the approach proposed originally 
in [4], resulting from adopting the frequency based calculus to rough sets. 

Definition 5, Let A = (f/,A, d), linear ordering Va = (r^i, . . . , r^r); t = |W|; 
and B C A he given. We call a B -rough membership distribution the function 
l^djB * ^ ^ ^r -1 defined hy^ 

= {Ld=l/B{'^)i • • • , Ld=r/B{'^)) (H) 

where, fork = l,...,r, = \W ^ Ms • d{u!) = t^fc}|/|Ms| is the 

rough membership function (cf. [3], [f], [9], [10]) labeling u £ U with the degree 
of hitting the k-th decision class with its B -indiscernibility class. 

The following is a straightforward generalization of Definition 2: 

Definition 6. Let A = (f/, A, d) be given. We say that B C A p-defines d in A 
iff for each u ^ U vje have 



We say that B C A is a p-decision reduct for A iff it p-defines d and none of 
its proper subsets does it. 

Rough membership distributions can be regarded as a frequency based source of 
statistical estimation of joint probabilistic distribution over the space of random 
variables corresponding to A U {d}. From this point of view, the above notion is 
closely related to the theory of probabilistic conditional independence (cf. [5]): 
Given A = (G, A, d), subset R C A is a /r-decision reduct for A iff it is a Markov 
boundary of d with respect to A, i.e., iff it is an irreducible subset, which makes 
d probabilistically independent on the rest of A. This analogy is important for 
applications of both rough set and statistical techniques of data analysis. 

Proposition 3. (cf. [9]) Let A = {U,A, d) be given. Subset B C A p-defines d 
iff it intersects with each element of the fi-discer nihility table defined by 

K = C A : T ^ 0 A = 4 )} (13) 



where 



{a G A : a{ui) ^ if Ld/Ai'^i) 
0 otherwise 



(14) 



Proposition 3 relates the task of searching for optimal /r-decision reducts to the 
procedure of finding minimal prime implicants, just like in the exact case be- 
fore. Such a relationship enables us to design efficient algorithms approximately 
solving the NP-hard problem of extracting minimal Markov boundaries. 

^ For any r G N, we denote by Ar-i the (r-l)-dimensional simplex of real valued 
vectors s = (s[l] , . . . , s[r]) with non- negative coordinates, such that 'f2]=is[k] = 1. 
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5 Distance Based Approximations of Rough Memberships 



Rough membership information is highly detailed, especially useful if other types 
of inconsistency representation turn out to be too vague for given data. On the 
other hand, it is too accurate to handle dynamical changes or noises in data 
efficiently To provide a more flexible framework for the attribute reduction, 
relaxation of criteria for being a /r-decision reduct is needed. The real valued 
specificity of frequencies enables us to introduce the whole class of intuitive 
approximations parameterized by the choice of: (1) the way of measuring the 
distance between distributions, and (2) thresholds up to which we agree to regard 
close states as practically indistinguishable: 

Definition 7. (cf. [9]) Let r G N 6e given. We say that q : ^ [0, 1] is a 

normalized distance measure iff for each s, s^^ G A^_i we have 



^(s, s^) = 0 s = ^(s, s'^) < ^(s, s^) -h g(sf s^^) 

(15) 

g(s, s') = g(s', s) g(s, s') = 1 3k^i(s[k\ = «'[/] = 1) 

Definition 8. (cf. [9]) Let A = (U,A,d), g : — > [0,1], f = IVb (ind 

£ G [0, 1) he given. We say that B C A (g, e)- approximately p-defines d iff for 
any u £ U 

KVd/sW, Vd/A(w)) < e ( 16 ) 

We say that B C A is a s)- approximate p-decision reduct iff it p-defines d 
e)- approximately and none of its proper subsets does it. 



Proposition 4. (cf, [9]) Let g : [0, 1] satisfying (15) and e G [0, 1) he 

given, Then^ the problem, of finding minimal [g^ s) -approximate fx-decision reduct 
is NP-hard, 



The above result states that for any reasonable way of approximating conditions 
of Definition 5 we cannot avoid potentially high computational complexity of the 
optimal feature reduction process. Still, the variety of possible choices of approxi- 
mation thresholds and distances enables us to fit data better, by an appropriate 
tuning. In practice it is more handful to operate with an easily parameterized 
class of normalized distance measures. Let us consider the following: 

Definition 9. Let x G [1,+ck)) and r G N 6e given. The normalized x-distance 
measure is the function x : ^ [0, 1] defined by formula 



x(s, s^) 



b[fc]-A[fc]r 



l/x 



k=l 



(17) 



One can see that for any x G [1,+ck)), function (17) satisfies conditions (15). 
The crucial property of (x, £)-approximations is the following: 
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Proposition 5. Let A = {U^A^ d)^ x G [1, +oo)^ £ G [0, 1) and B <Z A he given. 
If B intersects with each element of the {x^e)-discer nihility table 

Tl’" = {r C A : T £ 0 A 3i,£T = c^f)} (18) 

where 

_ f {a G A : a{ui) ^ if x( jx L d/A^^j)) ^ ^ 

\ 0 otherwise ^ 

then it {x^e) -approximately (x-defines d in A, If B does not intersect with some 
of elements of the above tahle^ then it cannot {x^d)~ approximately fx-define d^ 
for any s' ^sj2, 

6 Examples of Discernibility Tables 

Proposition 5 provides us with the unified methodology of calculating approxi- 
mate distance based reducts. We can keep using the procedure introduced for 
non-approximate decision reducts: 

— For £ G [0, 1), X G [1, +oo), construct the reduced (by absorption) 

— Find prime implicants for the corresponding (x, £)-discernibility function. 

It leads to a conclusion that one can apply well known discernibility based al- 
gorithms for the decision reduct optimization (cf. [6]) to searching for various 
types of approximate reducts. Moreover, an appropriate choice of approxima- 
tion parameters can speed up calculations by reducing the size of a discernibility 
structure. 

For an illustration, let us consider the exemplary decision table in Fig. 1. Since 
it is enough to focus on (x, £)-discernibility sets over pairs of objects discernible 
by A, we present our table in the probabilistic form (cf. [10]), where each record 
corresponds to an element r/.* G IJ j A of the set of 7 Ai2^(A)-classes. 



[//A 


IK]a| 


ai 


U2 


as 


U4 


as 




gd=2lA(yx) 


/^d=3/A(^|) 


u\ 


10 


1 


1 


0 


1 


2 


0.1 


0.5 


0.4 


U2 


10 


2 


1 


1 


0 


2 


1.0 


0.0 


0.0 


Us 


10 


2 


2 


2 


1 


1 


0.2 


0.2 


0.6 




10 


0 


1 


2 


2 


2 


0.8 


0.1 


0.1 


ul 


10 


0 


0 


0 


2 


2 


0.4 


0.2 


0.4 


ul 


10 


1 


2 


0 


0 


2 


0.1 


0.2 


0.7 



Fig. 1. The probabilistic table of A-indiscernibility classes (6 classes) labeled with 
their: ( 1 ) object supports (each supported by 10 objects); ( 2 ) A-ordered information 
vectors (5 conditional attributes); (3) /x-decision distributions (3 decision classes). 



In Fig. 2 we present the sizes of reduced (x, £)-discernibility tables, obtained for 
constant x = 2 under different choices of £ G [0, 1). The applied procedure of 
their extraction looks as follows: 
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— Within the loop over 1 < i < j <\U /A|, find pairs (n*, n*) corresponding to 
distributions remaining too far to each other in terms of inequality 

VdMK)) > e (20) 

— Simultaneously store corresponding (x, £)-discernibility sets C A in a 
temporary discemibility table, under an online application of absorption. 

Fig. 2 contains also basic facts concerning exemplary attribute subsets obtained 
by an application of simple, exemplary heuristics for searching for prime impli- 
cants. One can see that these subsets do satisfy conditions of Definition 8 for 
particular settings. 



X 


s 


^ pairs 


# elts. 


^ impls. 


avg. 


found implicants 


~Y~ 


0 


15 


3 


4 


2 


{1,2}, {2, 3}, {2, 4}, {3, 4} 


~Y~ 


0.2 


12 


3 


4 


2 


{1,2}, {2, 3}, {2,4}, {3,4} 


2 


0.4 


7 


4 


8 


2 


{1,2},{1,3},{1,4},{!,5} 
{2, 3}, {2, 4}, {3, 4}, {3, 5} 


2 


0.6 


5 


3 


5 


1.8 


{1,2},{1,4},{!,5},{2,4} 

{3} 


~Y~ 


0.8 


1 


1 


3 


1 


{1},{2},{3} 



Fig. 2. Absorption-optimized discemibility tables obtained for the above exemplary 
decision table, under various thresholds and fixed x = 2, where: ( 1 - 2 ) first two 
columns refer to (x, £)-settings; (3) The ^ pairs column presents the number of pairs 
of A-indiscernibility classes necessary to be discerned; (4) The ^ elts. column presents 
the number of attribute subsets remaining in a discemibility table after applying the 
absorption law; (5-7) The rest of columns contain the number, average cardinality, and 
detailed list of attribute subsets found as prime implicants for corresponding boolean 
discemibility functions. 



7 Conclusions 

We provide the unified methodology for searching for approximate reducts cor- 
responding to various ways of expressing inexact dependencies in inconsistent 
decision tables. In particular, we focus on rough membership reducts, which 
preserve conditions^ decision frequencies approximately, in terms of the choice 
of a tolerance threshold and a function measuring distances between frequency 
distributions. 

Presented results generalize well known relationship between rough set re- 
ducts and boolean prime implicants onto the whole class of considered approxi- 
mations. It leads to possibility of using the well known algorithmic framework 
for searching for minimal decision reducts (cf. [6] ) to the approximate /r-decision 
reduct optimization. 
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It is also worth emphasizing that introduced tools set up a kind of the rough 
set bridge between the approximate boolean calculus and the approximate pro- 
babilistic independence models. This fact relates our study to a wide range of 
applications dedicated, in general, to the efficient extraction and representation 
of data based knowledge. 

Finally, described example illustrate how one can influence efficiency of the 
process of the attribute reduction under inconsistency, by the approximation pa- 
rameter tuning. Still, further work is needed to gain more experience concerning 
the choice of these parameters in purpose of obtaining optimal models of the 
new case classification and data representation. 



Acknowledgements 

This work was supported by the grants of Polish National Committee for Scientific 

Research (KBN) No. 8T11C02319 and 8niC'02519. 

References 

1. Brown, E.M.: Boolean Reasoning. Kluwer Academic Publishers, Dordrecht (1990). 

2. Pawlak, Z.: Rough sets - Theoretical aspects of reasoning about data. Kluwer 
Academic Publishers, Dordrecht (1991). 

3. Pawlak, Z.: Decision rules, Bayes’ rule and rough sets. In: N. Zhong, A. Skowron 
and S. Ohsuga (eds.), Proc. of the Seventh International Workshop RSFDGrC’99, 
Yamaguchi, Japan, LNAI 1711 (1999) pp. 1-9. 

4. Pawlak, Z., Skowron, A.: Rough membership functions. In: R.R. Yaeger, M. Fe- 
drizzi, and J. Kacprzyk (eds.). Advances in the Dempster Shafer Theory of Evi- 
dence, John Wiley & Sons, Inc., New York, Chichester, Brisbane, Toronto, Singa- 
pore (1994) pp. 251-271. 

5. Pearl, J.: Probabilistic Reasoning in Intelligent Systems: Networks of Plausible 
Inference. Morgan Kaufmann (1988). 

6. Polkowski, L., Skowron, A. (eds.): Rough Sets in Knowledge Discovery, parts 1, 2, 
Heidelberg, Physica-Verlag (1998) pp. 321-365. 

7. Skowron, A.: Boolean reasoning for decision rules generation. In: Proc. of the Se- 
venth International Symposium ISMIS’93, Trondheim, Norway, 1993; J. Komorow- 
ski, Z. Ras (eds.), LNAI 689, Springer- Verlag (1993) pp. 295-305. 

8. Skowron, A., Rauszer, C.: The discernibility matrices and functions in information 
systems. In: R. Slowiiiski (ed.). Intelligent Decision Support. Handbook of Ap- 
plications and Advances of the Rough Set Theory, Kluwer Academic Publishers, 
Dordrecht (1992) pp. 311-362. 

9. Slezak, D.: Various approaches to reasoning with frequency- based decision reducts: 
a survey. In: L. Polkowski, S. Tsumoto, T.Y. Lin (eds.). Rough Sets in Soft Com- 
puting and Knowledge Discovery: New Developments, Physica-Verlag / Springer- 
Verlag (2000). 

10. Ziarko, W.: Decision Making with Probabilistic Decision Tables. In: N. Zhong, 
A. Skowron and S. Ohsuga (eds.), Proc. of the Seventh International Workshop 
RSEDGrC’99, Yamaguchi, Japan, LNAI 1711 (1999) pp. 463-471. 




Scalable Feature Selection Using Rough Set 

Theory 



Moussa Boussouf, Mohamed Quafafou 



IRIN, Universite de Nantes, 2 rue de la Houssiniere, 
BP 92208 - 44322, Nantes Cedex 03, FVance. 

{b 0 uss 0 uf 5 quafaf 0 ul@irin.univ-nantes.fr 



Abstract. In this paper, we address the problem of feature subset se- 
lection using rough set theory. We propose a scalable algorithm to find a 
set of reducts based on discernibility function^ which is an alternative so- 
lution for the exhaustive approach. Our study shows that our algorithm 
improves the classical one from three points of view: computation time, 
reducts size and the accuracy of induced model. 



1 Introduction 

The irrelevant and redundant features may reduce predictive accuracy, degrade 
the learner speed (due to the high dimensionality) and reduce the comprehen- 
sibility of the induced model. Thus, pruning these features or selecting relevant 
ones becomes necessary. 

In the rough set theory [7] [8], the process of feature subset selection is viewed 
as (relative)reducts computation. In this context, different works have been de- 
veloped to deal with the problem of feature subset selection. Modrzejewski [5] 
proposes a heuristic feature selector algorithm, called PRESET. It consists in 
ordering attributes to obtain an optimal preset decision tree. Kohavi and Frasca 
in [4] have shown that, in some situations, the useful subset does not necessarily 
contain all the features in the core and may be different from a reduct. Using 
Q!-RST [9], we have proposed in [10] an algorithm based on wrapper approach 
to solve this problem. We have shown that we can obtain lower size reducts with 
higher accuracy than those obtained by classic rough sets concepts. Skowron and 
Rauszer [13] mentioned that the problem of computing minimal reducts is NP- 
hard. They proposed an algorithm to find a set of reducts which have the same 
characteristics as original data. This algorithm needs to compute discernibility 
between all pairs of objects of training set, i.e., it performs compar- 

isons. Consequently, the reducts computation can be time consuming when the 
decision table has too many objects (or attributes). 

In this paper, we study the algorithm of finding reducts based on discerni- 
bility matrix [13]. We propose a new algorithm based on computing a minimal 
discernibility list, which is an approximative discernibility matrix. We show that 
our algorithm covers very early the search space with respect to discernibility 
function. It produces lower size reducts, more accurate models and time compu- 
tation is considerably less consuming comparing with the classical algorithm. 



W. Ziarko and Y. Yao (Eds.): RSCTC 2000, LNAI 2005, pp. 131-138, 2001. 
© Springer- Verlag Berlin Heidelberg 2001 
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2 Finding Reducts in Rough Set Theory 

In the rough set theory, an information system has a data table form. Formally, 
an information system 5 is a 4-tuple. S = (J7, Q, V, /), where U : is a finite set 
of objects. Q : is a finite set of attributes. V = where Vq is a domain of 
attribute q. f is an information function assigning a value for every object and 
every attribute^ i.e., f : U x Q ^ V , such that for every x £ U and for every 
qeQ f{x,q) G V,. 

Definition 1. Discernibility Relation: Let x,y £ U be two distinct objects. 
The discernibility relation, denoted by DIS, assigns pairs of objects to a subset of 
Q, i.e., DIS :U xU ^ ^{Q) where 3(Q) is all subsets ofQ. The discernibility 
between x and y is defined as follows: 

DIS{x,y) = {g G <5 I f{x,q) /(t/,g)} 

For each pair of objects, the discernibility relation assigns a set of attributes 
which discern these objects, i.e., it maps a pair of objects with a subset of 
Q. Consequently, we associate a discernibility matrix, denoted DM, to each 
information system, where DIS{i,j) is an element of the matrix DM which 
contains attributes distinguishing an objet i from another object j. 

Let two subsets R\ and R 2 such that R\ C ^ 2 * We say that R\ is a reduct of 
R 2 if, and only if, Ri has exactly the same characteristics (discrimination power, 
approximation space, etc.) as R 2 (see [7] [8] [13] for more details). 

Skawron and Rauszer [13] have proposed a finding reducts algorithm, which is 
based on discernibility matrix. They have proved that the problem of computing 
reducts in rough set model is transformable to the problem of finding prime 
implicants of monotonic boolean function called discernibility function. Their 
process of calculating reducts can be summarized in two steps: 

- Step 1: 

1.1. Computing the discernibility matrix DM: each element of DM contains 
a set of attributes which discern a pair of objects; 

1.2. Defining the discernibility function: This process leads to a conjunctive 
form of the discernibility function denoted DF. In fact, each element of DM 
produces a term represented by a disjunctive form and the conjunction of 
these terms defines DF; 

- Step 2: Computing reducts: this process consists of transforming the dis- 
cernibility function from a conjunctive normal form to a disjunctive one. 
In fact, this function is reduced performing basically absorption rules. This 
construction produces a reduced disjunctive form. Hence each term of this 
disjunctive form represents a reduct. 

Example 1: Let Q = {1,2, 3, 4, 5} be a set of attributes of an information 

system with 5 attributes. If DM = {{!}, {2, 3}, {3,5}, {1, 2}, (3, 4, 5}, {1, 2}}, 
then the discernibility function DF = 1 A (2 V 3) A (3 V 5) . The disjunctive form 
produces two reducts, which are {1,2,5} and {1,3}. 
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Extensive researches have been developed to find more refined reducts to 
deal with real world data and to construct accurate induced models using a 
machine learning algorithm. Consequently, different extended reduct definitions 
are introduced: (1) ^-reduct defined in the Variable Precision Rough Sets Model 
[14], (2) a-reduct formalized in the context of a generalized rough set theory, 
called a-Rough Set Theory [10][11] and (3) dynamic reducts [1]. 

The crucial problem in reducts research is that the mainly used algorithms 
are based on discernibility matrix and on the work developed by Skowron and 
Rauszer [13]. However, the calculation of DM is achieved by comparing all pairs 
of objects, i.e., it needs comparisons. The time complexity of the second 

step is bounded by p(n), where n is the standard code of discernibility func- 
tion, and p is a polynomial. So, the complexity of finding reducts process is 
O(AT^). This is prohibitive when we process very large datasets. Using the same 
algorithm, Bazan et al. [1] search dynamic reducts, which are in some sense the 
most stable reducts of a given decision table, i.e., they are the most frequently 
appearing reducts in subtables created by random samples of a given decision 
table. To achieve this process, they must perform comparisons for each 

considered sample. The total number of comparisons is equal to Ni{N^-i) ^ 

where Ni is the size of the sample. There are many problems faced when 
applying this approach: first, Kohavi and Frasca in [4] have shown that, in some 
situations, the useful subset does not necessarily contain all the features in the 
core and may be different from a reduct; second, due to the high complexity of 
the used algorithm, applying this algorithm many times aggravates the problem, 
especially when the size of each sample is important. 

Since the reducts depend on discernibility function of each sample, we say 
that the stability of reducts depends on stability of discernibility function (step 
1.2). Our approach to tackle this problem is different: instead of searching down- 
stream the reducts stability, we propose to stabilize upstream the discernibility 
function. 



3 Fast Scalable Reducts 

We have introduced in the previous section the discernibility relation which 
assigns pairs of objects to a subset of Q. We denote Q'(Q) the set of all subsets 
of Q which can be organized as a lattice according to the inclusion operator 
(c). Each node of the lattice represents a subset of attributes discerning one 
pair (or more) of objects. So, each element of DM has a corresponding node in 
the lattice. But, to define the discernibility function only minimal elements are 
needed, i.e., big elements are absorbed by the small ones. Consequently, during 
the step 1 of the algorithm, many comparisons are performed without changing 
this function. In this context, the problem of finding reducts is viewed as a 
search for minimal bound in the lattice. This bound is represented by Minimal 
Discernibility List (MDL) and is corresponding to an approximation of the 
reduced conjunctive form of the discernibility function. This approximation is 
computed by an incremental and random process following the two steps: 
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- Step 1: Calculate incrementally the minimal discernibility list, MDL. In 
fact, we select randomly two objects and we determine the subset of at- 
tributes P discerning them. The MDL is then updated as follows: if P is 
minimal, i.e., no element of MDL is included in F, then add F to MDL and 
remove all F' € MDL such that F C F'. This iterative process terminates 
when the probability to modify the MDL becomes small, (see section 3.1); 

- Step 2: Calculate reducts from MDL. This step consists in formulating DF 
from MDL., then transforming the conjunctive form to a reduced disjunctive 
one. Each final term represents a reduct. 



3.1 Probability Evolution 

Let Q be the set of M attributes, ^(Q) the space of all possible subsets that 
may be represented by a lattice. The cardinality of 9(Q) is 2^ — 1 (without 
considering the empty subset). 

Let F be a subset of Q. The number of supersets or coversets (including F) 
of F, denoted F(F), is equal to such that |F| denotes the cardinality of 

F. 

Let MDL = {Fi,F 2 , . . . , Pr} containing K subsets of Q. The total number 
of supersets, which have no efiPect on MDL., i.e., which are covered by all elements 
of MDL., is calculated as follows: 

JT{MDL) = (-1)1-1 

+(-1)"“^ Efi=i HPh^Pi^) 

i^>ii 

"h . . . 

+(-1)^“' Eif=i P{Ph u Pi, u . . . u Pi.) 

+ . . . 

+(-l)if-i U Pi, U . . . U Pi, U . . . U Pi« ) 



such that P{Pi^UPi^ U. . .UPi-) represents the number of supersets 

of all possible unions of j elements of MDL. 

For instance, if MDL = {Fi,F 2 }, then P{MDL) = F(Fi)-hF(F 2 ) -F(FiUF 2 ). 

The probability for improving the minimal discernibility list, denoted V{MDL), 
is calculated as follows: 



V{MDL) = 1- 



P{MDL) 

mQ)\ 



T{MDL) 

2^-1 



Example 2: Lei Q = {1,2, 3, 4, 5} be a set of attributes of an information 

system with 5 attributes. Supposing that the current MDL = {{1}, (2, 3}, {3, 5}}, 
the reader can check that the probability to improve MDL equals 1-22/31 = 0.29. 
It means that for the remaining comparisons, we have 71 % of chance that two 
objects are discernible by a subset which is covered by an element of MDL. 
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The Figure 1 shows the probability to improve MDL with respect to compar- 
isons number for Pima, Anntyroid and Shuttle datasets (see their characteristics 
in Table 1). For each dataset and at each iteration, two objects Oi and O 2 
are randomly selected and DIS{0i,02) is calculated. If there is no element in 
MDL which covers DIS{Oi, O 2 ) then MDL is improved and V{MDL) is then 
calculated. After performing N (which represents the dataset size) comparisons, 
the mean probability of 10 executions to improve MDL, equals 0.029, 0.038 and 
0.032 for Pima, Anntyroid and Shuttle respectively. 






COMPARISONS % 



(a) Pima 



(b) Annt3a-oid 



(c) Shuttle 



Fig, 1, Probability of MDL improvement. 



If we consider that this probability is smaller than a given threshold, the 
MDL represents a good approximation of discernibility function. Thus, only 
N comparisons are performed instead of comparisons of the algorithm 

based on computing full discernibility matrix. Consequently, only few compar- 
isons are performed to compute MDL. In fact, the percentage of comparisons 
performed for the three datasets Pima, Anntyroid and Shuttle are respectively 
0.261% , 0.053% and 0.0046%. We remark that the more the dataset size is 
important, the less is the percentage of the performed comparisons. 

3.2 FSR Algorithm 

Our approach is based on random comparisons between objects of the informa- 
tion system. At each iteration we check the possibility of MDL improvement. 
The process is stopped when a given number of comparisons is achieved. The 
reducts are then computed from the resulted MDL. Of course, the best solution 
is to consider the stopping criteria when the probability to improve MDL ex- 
ceeds a given threshold. Unfortunately the cost of the probability computation 
is very high when the MDL size is very large; the complexity of the probability 
computation is 0(2l^^^l), where \MDL\ represents the MDL cardinality. 

The first step of FSR algorithm (Figure 2) consists in computing the minimal 
discernibility list. After n = N, we consider that we have a good approximation 
of discernibility function. Consequently the complexity of MDL computation is 
0{N). The second step consists in computing the approximative reducts from 
MDL. The cost of this step depends on MDL size. The time complexity of 
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Input IS : Information System of N Examples; n : Comparisons Number; 

D : Condition Attributes; C : Class; /* Q = D U C */ 

Output FSR : Fast Scalable Reducts; 

STEP 1: for Comparisons:=l to n 

Random(i,j,N); /* return two objects : 1 ^ 4 ^ J ^ -/V 
R ~ DIS(IS[i],IS[j]); /’*' the set of features which discern IS[i] and ISp] */ 
if (C e fl) and E e MDL | E C R) 
add(R - {C}, MDL); /* Improve MDL */ 

endif 

endfor /*end construction of minimal MDL */ 

STEP 2: FSR— Reducts(MDL); /* Fast Scalable Reducts computation */ 



Fig. 2. FSR: Fast Scalable Reducts algorithm. 



reducts construction (by MDL transformation) is bounded by p{n), where n is 
the standard code of discernibility function, and p is a polynomial. 

4 Experimental Results 

In order to evaluate candidate scalable feature subsets generated by FSR algo- 
rithm, we ran experiments on 18 real-world datasets taken from the UCI Irvine 
repository, their characteristics are summarized in Table 1. Original datasets are 
transformed using Fayad and Irani discretization method [3]. Our hybrid ap- 
proach [2] is used to evaluate and select the best reduct. To achieve this process, 
we have used C4.5 [12] as an inducer algorithm and Liu&Setiono filter [6] (The 
allowable inconsistency rate is equal to 3% comparing with the best reduct ac- 
cording to the filter). We stop the MDL computation where the dataset size is 
exceeded. The seed random function is the same for all datasets. 

Our experiments presented in Table 2 show that the classical reducts and 
those generated with FSR algorithm generally improve the accuracy of C4.5 



Table 1. Datasets considered: Size of Train sets. Test protocol, Attributes number, 
Classes number and the percentage of numeric attributes. 



Datasets 


Train 


Test 


Att. 


Cla. 


Num% 


Datasets 


Train 


test 


Att. 


Cla. 


Num% 


Iris 


150 


5cv 


4 


3 


100 


Pima 


768 


5cv 


8 


2 


100 


Wine 


178 


5cv 


13 


3 


100 


Vehicle 


846 


5cv 


18 


4 


100 


Glass 


214 


5cv 


9 


6 


100 


German 


1000 


5cv 


20 


2 


35 


Heart 


270 


5cv 


13 


2 


46 


Segment 


2310 


5cv 


19 


7 


95 


Ecoli 


336 


5cv 


7 


8 


100 


Ab alone 


3133 


1044 


8 


3 


88 


Liver 


345 


5cv 


6 


2 


100 


Anntyroid 


3772 


3428 


21 


3 


29 


Breast WD 


569 


5cv 


30 


2 


100 


Pendigits 


7494 


3498 


16 


10 


100 


Australian 


690 


5cv 


14 


2 


43 


Adult 


32561 


16281 


14 


2 


43 


Credit 


690 


5cv 


15 


2 


210 


Shuttle 


43500 


14500 


9 


7 


100 
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using original datasets. The FSR algorithm improves the classical reducts algo- 
rithm from three points of view: 



Table 2. C4.5 accuracy with original datasets, the results of classical and FSR algo- 
rithms: C4.5 accuracy, reducts calculation time (in second) and the best reduct size. 



Datasets 


C4.5 Org. 


Classic 


FSR 


C4.5 


Time 


Size 


C4.5 


Time 


Size 


Iris 


92.00 


92.00 


0.03 


4 


92.00 


0.00 


2 


Wine 


80.12 


93.28 


0.10 


7 


95.46 


0.03 


3 


Glass 


63.84 


63.84 


0.09 


9 


65.38 


0.00 


4 


Heart 


74.08 


77.76 


0.18 


10 


83.72 


0.01 


6 


Ecoli 


81.56 


81.56 


0.19 


7 


81.56 


0.00 


5 


Liver 


69.00 


69.00 


0.15 


6 


68.42 


0.00 


4 


Breast WD 


94.38 


97.00 


273.83 


11 


96.28 


2.73 


3 


Australian 


86.68 


86.54 


1.13 


12 


86.10 


0.01 


7 


Credit 


82.64 


82.80 


1.05 


11 


86.32 


0.03 


5 


Pima 


78.40 


78.40 


1.08 


8 


78.92 


0.00 


7 


Vehicle 


71.28 


70.10 


2.09 


17 


71.28 


0.02 


9 


German 


71.90 


74.40 


3.27 


13 


75.00 


0.67 


7 


Segment 


93.64 


93.66 


15.95 


13 


93.54 


0.01 


5 


Abalone 


63.60 


63.60 


15.38 


8 


63.50 


0.02 


7 


Anntyroid 


94.00 


93.80 


52.06 


20 


92.70 


0.02 


5 


Pendigits 


91.70 


90.30 


161.90 


16 


89.60 


0.08 


10 


Adult 


85.40 


85.30 


1332.13 


13 


83.50 


0.09 


9 


Shuttle 


99.80 


99.80 


1861.43 


9 


99.70 


0.09 


6 


MEAN 


81.89 


82.95 


193.58 


10.66 


83.48 


0.21 


5.78 



1. Re ducts size: The size of reducts generated by FSR algorithm is always lower 
than the size of reducts produced by the classical (exhaustive) algorithm. The 
mean size of reducts is 5.78 for FSR algorithm, whereas it equals 10.66 for the 
classical one. The most important result is obtained with Breast dataset: among 
30 attributes, the FSR algorithm selects only 3 attributes and the accuracy is 
improved comparing with original data. 

2. Accuracy: The accuracy of reducts produced by FSR algorithm generally 
improves both the accuracy using original datasets and the accuracy of reducts 
produced by the classical algorithm, differences between average accuracies equal 
-hi. 59% and -hO.53% respectively. Comparing the accuracy of C4.5 using the best 
reducts generated by FSR algorithm with the accuracy of the original datasets, 
the worst accuracy is obtained with Pendigit s dataset, it falls by -2.1%, whereas 
the best one is obtained with Wine dataset, the accuracy improves by -hl5.34%. 

3. Time: The most interesting result is the time of calculating reducts with 
FSR algorithm. The average time is 0.215 for FSR algorithm and 193.58s for 
the classical algorithm. So, the algorithm was improved by 921.81 times. The 
time of FSR algorithm is, in most cases, lower than I 5 , except for Breast dataset 
(which explains that MDL contains many elements, so the second step of FSR 
algorithm is quietly slow). The best result was obtained with the largest dataset 
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i.e. Shuttle: the FSR was stopped after 0.09s, whereas the classical approach is 
achieved after 1861.43s, which represents a factor of 20682. 

5 Conclusion 

In this paper we have studied the problem of feature selection using rough set 
theory. We have proposed an algorithm to calculate reducts based on finding a 
minimal discernibility list from an approximative discernibility matrix. In this 
context, we have shown that the problem of reducts computation using the 
whole discernibility matrix is reducible to a problem reducts computation based 
on finding a fast minimal discernibility list, which covers a large space of the 
discernibility matrix. We claim the use of FSR algorithm instead of the classical 
one, which is based on computing full comparisons, for three main reasons: (1) 
the reducts size is lower; (2) the accuracy is generally higher; (3) the FSR time 
is too lower. 
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Abstract. An algorithm is considered which for a given decision table 
constructs a decision tree with minimal number of nodes. The class of 
all information systems (hnite and inhnite) is described for which this 
algorithm has polynomial time complexity depending on the number of 
columns (attributes) in decision tables. 



1 Introduction 

Decision trees are widely used in different areas of applications. Problem of 
optimal decision tree constructing is known to be complicated. In the paper an 
algorithm is considered, which for a given decision table constructs a decision 
tree with minimal number of nodes. The time of the algorithm work is bounded 
above by a polynomial on the number of columns and rows in a decision table 
and on the number of so-called nonterminal separable sub-tables of the table. 

Also decision tables over an arbitrary (finite or infinite) information system 
are considered, and all information systems are described for which the number 
of rows and the number of nonterminal separable sub-tables are bounded above 
by a polynomial on the number of columns (attributes) in the table. 

The idea of the algorithm is close to so-called descriptive methods of op- 
timization [3, 4, 5, 6]. The obtained results may be useful in test theory the 
groundwork for which was laid by [1], in rough set theory created in [9, 10] and 
in their applications. 

The algorithm allows generalization on the case of such complexity measures 
as the depth [8] and the average depth [2] of the decision tree. Similar results 
were announced in [7]. Also the considered algorithm may be generalized on the 
case when each column is assigned a weight. 

2 Basic Notions 

Decision table is a rectangular table filled by numbers from Ek = {0, . . . , A: — 1}, 
A: > 2, in which rows are pairwise different. Let 2' be a decision table containing 
n columns and m rows which are labeled with numbers 1, . . . , n and 1, . . . ,m 
respectively. Denote by dimT and by N{T) the number of columns and rows 
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in T respectively. Let for i = 1, . . . ,m, i-th row of the table T is assigned a 
natural number where 0 < Ui < . The m-tuple i/ = (i/i, . . . , i/^) provides 

the division of the set of the table T rows into classes, in which all rows are 
labeled with the same number. A two-person game can be associated with the 
table T. The first player choose a row in the table, and the second one must 
determine the number the chosen row is labeled. For this purpose he can ask 
questions to the first player: he can select a column in the table and ask what is 
the number on the intersection of this column and the chosen row. 

Each strategy of the second player can be represented as a decision tree that 
is a marked finite oriented tree with root in which each nonterminal node is 
assigned a number of column; each edge is assigned a number from Ek] edges 
starting in a nonterminal node are assigned pairwise different numbers; each 
terminal node is assigned a number from the set {z/i, . . . , Um,}- A decision tree 
that represents a strategy of the second player will be called correct 

As a complexity measure the number of nodes in decision tree is used. A 
correct decision tree with minimal number of nodes will be called optimal 

Let ii,. it e and Si, . . . ,St e Ek- Denote by T{ii,Si) . . . {it, St) 

the sub- table of the table T containing only such rows, which on intersections 
with columns labeled by A, • • • Ai have numbers A, • • • , respectively. If the 
considered sub-table differs from 1 ' and has at least one row then it will be called 
separable sub-table of the table T. The decision table will be called terminal if 
all rows in the table are labeled with the same number. For a nonterminal table 
T we denote by S{T) the set of all nonterminal separable sub-tables of the table 
T including the table T . 

3 Algorithm for Constructing of Optimal Decision Trees 

In this section the algorithm A is considered which for a given decision table T 
constructs an optimal decision tree. 

Step 0. If 'T is a terminal table and all rows of the table T are labeled with 
the number j, then the result of algorithm’s work is the decision tree consisting 
of one node, which is assigned the number j. Otherwise construct the set S{T) 
and pass to the first step. 

Suppose t > 0 steps have realized. 

Step (t+ 1). If the table T has been assigned a decision tree then this decision 
tree is the result of the algorithm A work. Otherwise choose in the set S{T) a 
table D satisfying the following conditions: 

a) the table D has not been assigned a decision tree yet; 

b) for each separable subtable Di of the table D either the table Di has been 
assigned a decision tree, or Di is a terminal table. 

For each terminal separable sub-table Di of the table D, assign to the table 
Di the decision tree r{Di). Let all rows in the table Di be labeled with the 
number j. Then the decision tree E{Di) consists of one node, which is assigned 
the number j. 
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For i G dim D} denote by E{D^ i) the set of numbers containing in the 

i-th column of the table and denote I{D) = {i : i E ,dimF>}, \E{D^ i)| > 

2}. For every i E I{D) and each S E E{D^i) denote by r{i^S) the decision tree 
assigned to the table D{i^S). Let i E I{D) and E{D^i) = Define 

a decision tree Ei. The root of E is assigned the number i. The root is initial 
node of exactly r edges di, . . . , which are labeled by the numbers di, . . . , 
respectively The roots of the decision trees T(i, di), . . . , T(i, d^) are terminal 
nodes of the edges di, . . . , d^ respectively. Assign to the table D one of the trees 
Ei^ i E /(D), having minimal number of nodes and pass to the next step. 

It is not difficult to prove the following statement. 

Theorem 1. Eor any decision table T filled by numbers from Ek^ the algorithm 
v4 constructs an optimal decision tree^ and performs exactly |yS'(T)| + l steps. The 
time of the algorithm A work is bounded below by c\S(T)\^ where c is a positive 
constant y and bounded above by a polynomial on N(T)^ dim 2' and lyS'Ci')!. 



4 Decision Tables over Information Systems 

Let A be a nonempty set, E a nonempty set of functions from A to i4, and 
/ ^ const for any f E E. Functions from E will be called attributes^ and the 
pair U = (A, D) will be called k -valued information system. 

For arbitrary attributes /i, . . . ,/n G E and an arbitrary function v : ^ 

{0, 1, . . . , we denote by T(/i, . . . , /n, the decision table with n columns 
which contains the row (di, . . . , d^) G E]b iff the system of equations 

{fl(x) =Si,. . .J„{x) = Sn} (1) 

is compatible on the set A. This row is assigned the number i/(di, . . . , d^). The 
table 2 '(/i, ... ,/n,, i/) will be called a decision table over the information system 
U. Denote by T{U) the set of decision tables over U. 

Consider the functions 

Su{n) = max{|5(T)| : T E T(C),dimT < n} 

and 

A/? 7 (n) = max{AT(2') :T E T{U)^dimT < n}, 

which characterizes the maximal number of nonterminal separable sub-tables 
and maximal number of rows respectively depending on the number of columns 
in decision tables over U, 

Using Theorem 1 one can show that for tables over U time complexity of 
the algorithm A is bounded above by a polynomial on the number of columns if 
both the functions Su{n) and Afu{n) are bounded above by a polynomial on n, 
and time complexity of the algorithm A has an exponential lower bound if the 
function Su{n) grows expoinentially. 

A system of equations of the kind (1) will be called a system of eguations over 
U. Two systems of equations are called eguivalent if they have the same set of 
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solutions. A compatible system of equations will be called uncancellahle if each 
its proper subsystem is not equivalent to the system. Let r be a natural number. 
Information system U will be called r-restricted (restricted) if each uncancellable 
system of equations over U consists of at most r equations. 

Theorem 2. Let U = (A^F) he k-valued information system. Then the follow- 
ing statements hold: 

a) if U is r-restrieted information system then ^{n) < (nky + 1 and 
A/? 7 (n) < (nky + 1 for any natural n; 

h) if U is not restrieted information system then Sjj{n) >2^ — 1 for any 
natural n. 

Proof, a) Let U be r-restricted information system and T = 2'(/i, . . . , G 

T{U). One can show that both the values |A(2')| and N{T) do not exceed the 
number of pairwise nonequivalent compatible subsystems of the system of equa- 
tions {/i(x) = 0, . . = 0, . . =k - 1 ,.. . Jn{x) = k- l} including 

the empty system (the set of solutions of the empty system is equal to A). Each 
compatible system of equations over U contains an equivalent subsystem with 
at most r equations. Then |A(2')| < {dimTyk^ 1 and NfT) < {dimTyk^ 1. 
Therefore Su{n) < (nky + 1 and J\fu{n) < (nky + 1. 

b) Let U be not a restricted system and n be a natural number. Then there 

exists an uncancellable system of equations over U with at least n equations. 
Evidently, each subsystem of this system is uncancellable. Therefore there exists 
an uncancellable system over U with n equations. Let it be the system (1), which 
will be denoted by W . We prove that every two different subsystems Wi and 
W 2 of the system W are nonequivalent. Assume the contrary. Then subsystems 
W \ {Wi \ W 2 ) and W \ (W 2 \ Wi) are equivalent to W and at least one of them 
is a proper subsystem of IE, which is impossible. Denote T = 2'(/i, . . . , 
where 1 / is the function, that sets to the the coprrespondence to the n-tuple 
S G the value for which ^ is a notation in the system with the radix k. Each 
proper subsystem {fi^ (x) = ^ 1 , . . . , fi^x) = of the system W corresponds to 
the separable subtable T{fi ^ , Si){fi ^ , St) of the table T. Two different subsystems 
being nonequivalent each other and being nonequivalent to the system IE, the 
subtables corresponding to these subsystems are different and nonterminal. Then 
\Syr)\ > 2 ^- 1 . Hence Su{n) >2^-1. □ 

Example 3. Denote by A the set of all points in the plane. Consider an arbitrary 
straight line /, which divides the plane into positive and negative open half- 
planes. Put into correspondence a function / : A ^ {0, 1} to the straight line 1. 
The function / takes the value 1 if a point is situated in positive half-plane, and 
/ takes the value 0 if a point is situated in negative half- plane or on the line 1. 
Denote by F an infinite set of functions, which correspond to some straight lines 
in the plane. Consider two cases. 

1) Functions from the set F correspond to t infinite classes of parallel straight 
lines. One can show that the information system U is 2t-restricted. 

2) Functions from the set F correspond to all straight lines in the plane. 
Then the information system U is not restricted. 
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Abstract. Dimensionality is an obstacle for many potentially power- 
ful machine learning techniques. Widely approved and otherwise elegant 
methodologies exhibit relatively high complexity. This limits their ap- 
plicability to real world applications. Friedman’s Multivariate Adaptive 
Regression Splines (MARS) is a function approximator that produces 
continuous models of multi- dimensional functions using recursive parti- 
tioning and multidimensional spline curves that are automatically adap- 
ted to the data. Despite this technique’s many strengths, it, too, suffers 
from the dimensionality problem. Each additional dimension of a hy- 
perplane requires the addition of one dimension to the approximation 
model, and an increase in the time and space required to compute and 
store the splines. Rough set theory can reduce dataset dimensionality as 
a preprocessing step to training a learning system. This paper invest- 
igates the applicability of the Rough Set Attribute Reduction (RSAR) 
technique to MARS in an effort to simplify the models produced by the 
latter and decrease their complexity. The paper describes the techniques 
in question and discusses how RSAR can be integrated with MARS. The 
integrated system is tested by modelling the impact of pollution on com- 
munities of several species of river algae. These experimental results help 
draw conclusions on the relative success of the integration effort. 



1 Introduction 

High dimensionality is an obstacle for many potentially powerful machine learn- 
ing techniques. Widely approved and otherwise elegant methodologies exhibit 
relatively high complexity. This places a ceiling on the applicability of such ap- 
proaches, especially to real world applications, where the exact parameters of 
a relation are not necessarily known, and many more attributes than necessary 
are used to ensure all the necessary information is present. 

Friedman’s Multivariate Adaptive Regression Splines (MARS) [5] is a useful 
function approximator. MARS employs recursive partitioning and spline Basis 
functions to closely approximate the application domain [1]. The partitioning and 
number of Basis functions used are automatically determined by this approach 
based on the training data. Unlike other function approximators, MARS pro- 
duces continuous, differentiable approximations of multidimensional functions, 
due to the use of spline curves. The approach is efficient and adapts itself to 
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the domain and training data. However, it is a relatively complex process, and 
suffers from the curse of dimensionality [5]. Each dimension of the hyperplane 
requires one dimension for the approximation model, and an increase in the time 
and space required to compute and store the splines. 

Rough set theory [6] is a methodology that can be employed to reduce the 
dimensionality of datasets as a preprocessing step to training a learning system 
on the data. Rough Set Attribute Reduction (RSAR) works by selecting the 
most information-rich attributes in a dataset, without transforming the data, 
all the while attempting to lose no information needed for the classification task 
at hand [2]. The approach is highly efficient, relying on simple set operations, 
which makes it suitable as a preprocessor for more complex techniques. This 
paper investigates the application of RSAR to preprocessing datasets for MARS, 
in an effort to simplify the models produced by the system and decrease their 
complexity. The integrated system is used to build a model of river algae growth 
as influenced by changes in the concentration of several chemicals in the water. 
The success of the application is demonstrated by the reduction in the number 
of measurements required, in tandem with accuracy that matches very closely 
that of the original, unreduced dataset. 

The paper begins by briefly describing the two techniques in question. Issues 
pertaining to how RSAR can be used with MARS are also discussed and the 
integrated system is described. Experiments and their results are then provided 
and discussed, and conclusions are drawn about the overall success of the integ- 
ration effort. 

2 Background 

The fundaments of MARS and RSAR are explained below. The explanations are 
kept brief, as there already exist detailed descriptions of both techniques in the 
literature [5,7]. 



2.1 Multivariate Adaptive Regression Splines 

MARS [5] is a statistical methodology that can be trained to approximate mul- 
tidimensional functions. MARS uses recursive partitioning and spline curves to 
closely approximate the underlying problem domain. The partitioning and num- 
ber of Basis functions used are automatically determined by this approach based 
on the provided training data. 

A spline is a parametric curve defined in terms of control points and a Basis 
function or matrix, that approximates its control points [4]. Although splines do 
not generally interpolate their control points, they can approximate them quite 
closely. The basis function or matrix provides the spline with its continuous 
characteristics. Two- and 3-dimensional splines are used widely in computer 
graphics and typography [1]. 

MARS adapts the general, n-dimensional form of splines for function approx- 
imation. It generates a multi-dimensional spline to approximate the shape of the 
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domain’s hyper-plane. Each attribute is recursively split into subregions. This 
partitioning is performed if a spline cannot approximate a region within reason- 
able bounds. A tree of spline Basis functions is thus built. This allows MARS 
great flexibility and autonomy in approximating numerous deceptive functions. 
MARS models may be expressed in the following form, known as ANOVA de- 
composition [5]: 



/(x) — flo ^ ^ H" ^ ^ “I" ^ ^ 

= l K^=2 Kr^=3 

where qq is the coeflicient of the constant Basis function Bi [5], fi is a univariate 
Basis function of Xi, fij is a bivariate Basis function of Xi and Xj, and so on. 
In this context, K^, is the number of variables a Basis function involves. The 
ANOVA decomposition shows how a MARS model is the sum of Basis functions, 
each of which expresses a relation between a subset of the variables of the entire 
model. As an example, a univariate Basis function fi is deflned as: 

Krr, 

^ p 1 gi 

M^i) = ^ where = ^[^Skm ■ {xkm ~ tkm) ■ 

Km=l k = l 

Here, is the coefficient of Basis function which only involves variable Xi. 
Bm is the Basis function in question, involving Km ordinates Xkm of point x 
{1 < k < Km)j Q is the order of the multivariate spline, with > 1; Skm = il; 
and tkm is ordinate m of the control point tm- 

MARS uses recursive partitioning to adjust the Basis functions’ coefficients 
({am}i ^5 for each Basis function Bm-> 1 < m < M), and to partition the universe 
of discourse into a set of these disjoint regions {Rm}^ • A region R is split into 
two subregions if and only if a Basis function cannot be adjusted to fit the data 
in R within a predefined margin [5]. 

Unlike many other function approximators, MARS produces continuous, dif- 
ferentiable approximations of multidimensional functions, thanks to the use of 
splines. MARS is particularly efficient and produces good results. The continuity 
of the resultant approximative models is one of the most desirable results if stat- 
istical analysis is to be performed. However, MARS is relatively complex, and 
suffers from the curse-of-dimensionality problem. Each dimension of the hyper- 
plane requires one dimension for the approximation model, and an increase in 
the time and space required to compute and store the splines. The time required 
to perform predictions increases exponentially with the number of dimensions. 
Further, MARS is very sensitive to outliers. Noise may mar the model by caus- 
ing MARS to generate a much more complex model as it tries to incorporate 
the noisy data into its approximation. A technique that simplified the produced 
models and did away with some of the noise would thus be very desirable. This 
forms the very reason that the Rough Set-based Attribute Reduction technique 
is adopted herein to build an integrated approach to multivariate regression with 
reduced dimensionality. 
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2.2 The Rough Set Attribute Reduction Method (RSAR) 

Rough set theory [6] is a flexible and generic methodology. Among its many uses 
is dimensionality reduction of datasets. RSAR, a technique utilising Rough set 
theory to this end, removes redundant input attributes from datasets of nominal 
values, making sure that very little or no information essential for the task at 
hand is lost. In fact, the technique can improve the information content of data: 
by removing redundancies, learning systems can focus on the useful information, 
perhaps even producing better results than when run on unreduced data. 

RSAR works by maximising a quantity known as degree of dependency. The 
degree of dependency ^p{X) of a set Y of decision attributes on a set of con- 
ditional attributes X provides a measure of how important that set of condi- 
tional attributes is in classifying the dataset examples into the classes in Y. If 
= 0, then classiflcation Y is independent of the attributes in X, hence 
the conditional attributes are of no use to this classification. If 7 = 1, then Y 
is completely dependent on X, hence the attributes are indispensable. Values 
0 < 7x(V) < I denote partial dependency. To calculate 7x(V), it is necessary 
to define the indiscemibility relation. Given a subset of the set of attributes, 
P C A, two objects x and ^ in a dataset U are indiscernible with respect to 
P if and only if f{x,Q) = f{y,Q) V Q C F (where /(a,F) is the classific- 
ation function represented in the dataset, returning the classification of object 
a using the conditional attributes contained in the set B). The indiscernibil- 
ity relation for all F G A is written as IND(F) (derived fully elsewhere). The 
upper approximation of a set F C U, given an equivalence relation IND(F), 
is defined as PY = \J{X : X G U/IND(F),X H V ^ 0}. Assuming equival- 
ence relations F, Q in U, it is possible to define the positive region POSp(Q) as 
POSp(Q) = \JxeQ—^- Based on this, 



7f(Q) = 



POSp(Q) 

null 



where || Set || is the cardinality of Set The naive version of the RSAR algorithm 
evaluates jpiQ) for all possible subsets of the dataset’s conditional attributes, 
stopping when it either reaches 1, or there are no more combinations to in- 
vestigate. This is clearly not guaranteed to produce the minimal reduct set of 
attributes. Indeed, given the complexity of this operation, it becomes clear that 
naive RSAR is intractable for large dimensionalities. 

The QuickReduct Algorithm [2] escapes the NP-hard nature of the naive 
version by searching the tree of attribute combinations in a best-first manner. 
It starts off with an empty subset and adds attributes one by one, each time 
selecting the attribute whose addition to the current subset will offer the highest 
increase of 7p(Q). The algorithm stops on satisfaction of one of three conditions: 
a- 7 p(0) of 1 is reached; adding another attribute does not increase 7; or all 
attributes have been added. As a result, the use of QuickReduct makes RSAR 
very efficient. It can be implemented in a relatively simple manner, and a number 
of software optimisations further reduce its complexity in terms of both space 
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and time. Additionally, it is evident that the RSAR will not compromise with 
a set of conditional attributes that contains large part of the information of the 
initial set — it will always attempt to reduce the attribute set without losing 
any information significant to the classification at hand. 



3 An Application Domain 

Concern for environmental issues has increased greatly in the last decade [3]. 
The production of waste, toxic and otherwise, from a vast number of different 
manufacturing processes and plants is one of the most important issues. It in- 
fluences directly the future of humanity’s food and water supply. It has become 
clear that even changes in farming and sewage water treatment can affect the 
river, lake and sea ecologies. 

The alga, an ubiquitous single-celled plant, is the most successful coloniser 
of any ecology on the planet. There are numerous diflFerent species of algae, 
and most of them respond rapidly to environmental changes. Wild increases in 
summer algae population in recent years are an indirect result of nearby human 
activities. Booming algae communities are detrimental to water clarity, river life 
and human activities in such areas, since algae growth is associated with toxic 
effects. Biologists are attempting to isolate the chemical parameters that control 
such phenomena. 

The aim of this application of the MARS and RSAR techniques is to predict 
the concentration of seven species of river alga, based on a set of parameters. 
Samples were taken from European rivers over the period of one year, and ana- 
lysed to measure the concentrations of eight chemicals. The pH of the water was 
also measured, as well as the season, river size and flow rate. Population distribu- 
tions for each of the species were determined in the samples. It is relatively easy 
to locate relations between one or two of these quantities and a species of algae, 
but the process of identifying such relations requires well-trained personnel with 
expertise in Chemistry and Biology and involves microscopic examination that is 
difficult to Thus, such a process becomes expensive and slow, given the number 
of quantities involved here. 

The dataset includes 200 instances [3]. The first three attributes of each in- 
stance (season, river size and flow rate) are represented as linguistic variables. 
Chemical concentrations and algae population estimates are represented as con- 
tinuous quantities. The dataset includes a few missing values. To prepare the 
dataset for use by this technique, the linguistic values were mapped to integers. 

RSAR is relatively easy to interface to MARS. The only obstacle is the fact 
that RSAR works better with discrete values. Massaging the data into a suitable 
representation is therefore necessary. The dataset’s first three conditional attrib- 
utes are already discrete values. The chemical concentrations exhibit an exponen- 
tial distribution (as shown in figure 1). These were transformed by [log(a: + 1)J, 
where x is the attribute value. This was only necessary for RSAR to make its 
selection of attributes. The transformed dataset was not fed to MARS. 
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Fig, 1, Density plots for three of the algae dataset attributes. 



The seven decision attributes were converted to a logarithmic scale, then 
quantised into four regions each to signify four exponentially increasing levels 
of abundance. This processing was performed for the sake of both RSAR and 
MARS. Although other preprocessing techniques [7] may also be employed to 
implement this kind of conversion, the use of this particular quantisation is reas- 
onable in a real-world context because of the way the algae population ‘counts’ 
are obtained. It is assumed that the river’s water is perfectly homogeneous and 
that any sample of the water, no matter how small, is statistically representat- 
ive. A few drops of each sample are examined visually via microscope and the 
number of algae are counted. This allows for human errors in determining the 
population, as well as statistical inaccuracies. Quantisation alleviates this prob- 
lem. In addition, if the aim is to predict the behaviour of algae communities, it is 
far more intuitive to provide linguistic estimates of the population like ‘normal’, 
‘lower’ and ‘higher’. 

4 Experimental Results 

This paper claims that the application of RSAR to MARS can reduce the time 
and space required to store MARS models, and that the accuracy of the models 
does not drop significantly. To test these claims, two series of experiments were 
performed: one produced MARS models based on the original, unreduced data; 
the other employed RSAR to reduce the dimensionality of the data and invoked 
MARS to produce models. The Algae dataset was split randomly (using a 50% 
split ratio) into training and test datasets, both transformed as described earlier. 
100 runs were performed for each experiment series. 

For the second experiment, RSAR was run on the suitably preprocessed 
algae dataset. The reduction algorithm selected seven of the eleven conditional 
attributes. This implies that the dataset was reasonably information-rich before 
reduction, but not without redundancies. 

For convenience, each of the seven algae species (one for each decision attrib- 
ute) were processed separately in order to provide seven different MARS models 
for each experiment run. This simplified assessing the results. For each species, 
the overall root-mean-square of the difference between the known labelling of a 
datum and the corresponding MARS prediction was obtained. Although the in- 
put conditional attributes are integer due to their quantisation, MARS predicted 
values are, of course, continuous values. It was preferred not to quantise these in 
determining the error, so as to perceive the situation more accurately. To provide 
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Table 1. Experimental results, showing RMS errors. 

Before After 



Alga 


Min 


Max 


Min 


Max 


Species A 


0.923 


1.639 


0.924 


1.642 


Species B 


0.893 


1.362 


0.932 


1.389 


Species C 


0.822 


1.202 


0.856 


1.206 


Species D 


0.497 


0.748 


0.595 


0.768 


Species E 


0-723 


1.210 


0.768 


1.219 


Species F 


0.762 


1.158 


0.892 


1.259 


Species G 


0.669 


0.869 


0.689 


0.872 



a clearer picture, errors for both quantised-quantised and continuous-quantised 
results are shown. They help provide a minimum and maximum for the expected 
errors of the experiments. 

The results are shown on table 1. Minimum and maximum RMS errors are 
shown separately for each alga species. It is clear from these results that the 
implications of employing RSAR as a preprocessor for MARS are minimal. The 
slight drops in accuracy exhibited after the dimensionality reduction indicates 
that the process has removed some of the necessary information. However, a 
fuller investigation of the domain reveals that this is due to the quantisation 
process employed for this domain, rather than the RSAR methodology itself. 
Despite this accuracy reduction, however, the MARS models obtained from the 
low-dimensionality data are simpler by at least a factor of 2^. This is based 
on a conservative assumption that each of the four removed attributes is split 
into only two subregions by MARS. Given the relative complexity of even small 
MARS models, this reduction in model size is particularly welcome. Processing 
time required by MARS decreases similarly, although the algorithm’s efficiency 
is such that time requirements are not as important as space requirements. 

The advantages of reducing the dimensionality extend to the runtime of the 
system: performing predictions with the simplified MARS models is much faster 
and easier. The drop in dataset dimensionality allows for fewer measured vari- 
ables, which is important for dynamic systems where observables are often re- 
stricted, or where the cost of obtaining more measurements is high. Reducing 
the number of measurements to be made significantly enhances the resultant 
system, with minimal impact on accuracy. 

5 Conclusion 

Most attempts to build function approximators and learning systems of all types 
stumble on the curse of dimensionality. This enforces a ceiling on the applicability 
of many otherwise elegant methodologies, especially when applied to real world 
applications, where the exact parameters of a relation are not necessarily known, 
and many more attributes than necessary are sometimes used to ensure that all 
the information is present. 
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MARS is a function approximator that generates continuous models based on 
spline curves and divide-and-conquer recursive partitioning techniques. Rough 
set theory can be employed to reduce the dimensionality of datasets. RSAR se- 
lects the most information rich attributes in a dataset, while attempting to lose 
no information needed for the classification task. This paper has presented an 
approach that integrates RSAR and MARS. RSAR helps reduce the dimension- 
ality of the domain with which MARS has to cope. The RSAR algorithm has 
proved to be both generalisable and useful in stripping datasets of insignificant 
information, while retaining more important conditional attributes. 

MARS is sensitive to outliers and noise and is particularly prone to high- 
dimensionality problems. Employing RSAR as a preprocessor to MARS provides 
accurate results, emphasising the strong points of MARS, and allowing it to 
be applied to datasets consisting of a moderate to high number of conditional 
attributes. The resultant MARS model becomes smaller and is processed faster 
by real-world application systems. 

Neither of the two techniques are, of course, perfect. Improvements to the 
integration framework and further research into the topic remain to be done. 
For instance, investigating a means for RSAR to properly deal with unknown 
values would be helpful in a smoother co-operation between RSAR and MARS. 
Also, it would be interesting to examine the effects of the use of the integrated 
system in diflFerent application domains. Ongoing work along this direction is 
being carried out at Edinburgh. 
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Abstract. This paper presents the RClass system, which was designed as a 
tool for data validation and the classification of uncertain information. This 
system uses rough set theory based methods to allow handling uncertain 
information. Some of proposed classification algorithms also employ fuzzy 
set theory in order to increase a classification quality. The knowledge base of 
the RClass system is expressed as a deterministic or non-deterministic 
decision table with quantitative or qualitative values of attributes, and can be 
imported from standard databases or text files. 

Keywords: Rough sets, fuzzy sets, classifier 



1. Introduction 

In many real situations, man can choose the proper decision on the basis of 
uncertain (imprecise, incomplete, inconsistent) information. The imitation of this 
ability with automatic methods requires the understanding of how humans collect, 
represents, process and utilize information. Although this problem is far from being 
solved, several promising formalisms have appeared, like fuzzy set theory proposed 
by L.A. Zadeh [11] and rough set theory proposed by Z. Pawlak [7]. Fuzzy set 
theory allows the utilization of uncertain knowledge by means of fuzzy linguistic 
terms and their membership functions, which reflects human’s understanding of the 
problem. Rough set theory enables to find relationships between data without any 
additional information (like prior probability, degree of membership), only 
requiring knowledge representation as a set of if-then rules. 

Rough set theory based classification methods have been implemented in the 
RClass system [3], which will be described in this paper. This system also uses 
hybrid fuzzy-rough methods in order to improve the classification of quantitative 
data. Both these approaches - rough and fuzzy-rough are discussed and compered 
below. 

The paper is divided into six sections. Section 1 contains introductory remarks. 
Sections 2 and 3 introduce the theoretical basis of the RClass system: - the rough 
classification method is presented in section 2, the so-called fuzzy-rough method is 
proposed in section 3. The architecture of the system is briefly described in section 
4, section 5 presents simple numerical example and, finally, the concluding remarks 
are detailed in section 6. 



W. Ziarko and Y. Yao (Eds.): RSCTC 2000, LNAI 2005, pp. 152-159, 2001. 
© Springer- Verlag Berlin Heidelberg 2001 
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2. The Rough Classification Method 

The knowledge base of the RClass system is represented by a decision table [5], 
denoted as DT=<U, A, B, V, f>, where U is a finite set of objects, A - a finite set of 

condition attributes, B - a finite set of decision attributes, V = Vq , where Vq 

qGQ=AuB 

is a domain of the attribute q, and f: UxQ^V is called the information function. 
For MISO (Multiple Input Single Output) type systems, like RClass, we can 
assume that B={d} and A={ai,a 2 ,...aN}. The decision table is deterministic 
(consistent) if A^B. In other cases it is non-deterministic (inconsistent). 

System RClass is an example of a maximum classifier, which decides: 

observation x’e class yk ^ B’(x’, yk)= max{B’ (x',y| )}. (1) 

Function B’(x’,yO, called the decision function, gives the certainty of 
classification observation x’ to class yk. We propose to define this function, by 
analogy to the fuzzy compositional rule of inference [12], as: 

B'(x',y) = t’’(TA(x'),lA(i-))*T l^’'(lA(dTB(y))), (2) 

reU 

where, respectively, Ta, Tb, Ia are tolerance sets containing observation x’, 
decision yi and rule r. Operators *t ,*s denote t-norms (e.g. min) or t-conorms (e.g. 
max) and symbol |Ll^ stands for rough inclusion [8]. 

Tolerance sets Ta, Tb, Ia can be regarded as a set of all objects, with similar 
values of respective attributes: 

Tp(x') = {r,e U: DeSp(r)Px'}; (3) 

Ip (r) = {r, E U : Desp(r) P Desp(r, )}, (4) 

in which symbol P denotes a tolerance relation defined for a set of attributes P. 
The description Desp(r) of object reU in terms of values of attributes from P is 
known as: 

Desp(r) ={(q,v): f(r,q)=v, VqeP}. (5) 

Employing tolerance approximation spaces [9,10] allows classification of 
unseen objects and objects with quantitative values of attributes without 
discretization. The definition of tolerance relation in RClass system is based on 
concept of similarity function: 

Vr, r’eU Desp(r) P Desp(r’) ^ pp(r, r’) > f, (6) 

where pp stands for similarity function, re[0,l] denotes the arbitrary chosen 
threshold value of this similarity function. 
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The similarity function measures the degree of a resemblance between two 
objects r, r’e U and can be defined, on the basis of the set of attributes P, as 
follows: 



P P (!■, >■' ) = X q. * ® ^ ))) ’ 

q,eP 



( 7 ) 



where: symbol ^ stands for any t-norm (e.g. min, prod) or average operator, 

qiEP 

Wq. represents the weight of attribute qi, 

Sq. denotes the measure of similarity between values of attribute qi. As 

this measure, according to the types of values of the attribute, can be used 
one of functions presented in the table 1 . 



Table 1. Similarity measures for different types of attributes 



Type of attribute qi 


Similarity measure 


Quantitative 
(e.g. age=18,23, ...) 


Sqi(as,at)-1 ' " 

max. -mm. 

4i 4i 


Qualitative 

(e.g. color = red, green, ...) 


ro,ifas=at 

Sqi(as,at) = <! 

1 , I T a ^ 7^ a 


Ordered qualitative 

(e.g. pain = weak, medium, ...) 


/- ^ 1 



Symbols max^, and min^. , used in the table 1, denotes respectively maximal or 
minimal value of attribute 

In systems containing only qualitative attributes, the tolerance relation defined 
above becomes an equivalence relation and therefore the equation (2) can be 
simplified to following formula: 

B’(x’,y.)=|i'’(TA(x’),TB(y.)) (8) 

This formula stands for the problem of choosing the best decision rule: 

R.:TA(x’)^TB(y.) (9) 

Such decision rule is deterministic (consistent or certain) in DT if TA(x’)cTB(yO, 
and Ri is non-deterministic (inconsistent or possible) in DT, for other cases. In 
RClass system, decision rules are evaluated using the concept of rough inclusion. 
Here are examples of rough inclusions, which are implemented in this system: 
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|lf(X,Y) = 
I^2(X,Y) = 



jl, ifX c Y 
0, otherwise 
r card(X n Y) 
] card(X) 



JeuliX^0_ 



1, jeuli X = 0 



P ^ card((U\x)uY) . 
^ ’ card(U) 



|iJ(X,Y)- 



1, XeY 

J ca.rd(Y) 

X cz Y, 

card(U) 



( 10 ) 

( 11 ) 

( 12 ) 

(13) 



Inclusion 1 is a typical inclusion from set theory and allows only refining 
deterministic rules from non-deterministic. Inclusion 2 is well known [8] as a 
standard rough inclusion. For systems with specified strength factors and 
specificity factors [2] the definition of cardinality operator „card” can be extended 
as follows: 



card^(X) = ^ Strength _ factor (q ) ^ Specificity factor(r| ) *Xx (fi) ? (14) 

r^eU 



where Xx denotes the characteristic function of set X. 



3. The Fuzzy-Rough Classification Method 

Applying rough set theory based analysis to the knowledge, which contains 
measurable, quantitative data, requires using specialized discretization methods or 
employing a tolerance relation instead of a equivalence relation (as it was proposed 
above). The disadvantage of both these solutions is that they group similar objects 
to classes without remembering differences between objects of the same class. We 
think that this lost, additional information would improve the classification process 
and therefore for such measurable data we suggest using the fuzzy-rough 
classification method that was implemented in RClass system. This method, which 
is a modification of presented above rough method, allow taking into account not 
only the fact of similarity between two objects, but also the degree of this 
similarity. 

In the fuzzy-rough classification method, we defined classes of tolerance relation 
as frizzy sets - the membership functions of these sets determine the degree of 
similarity of objects, which belong to the same class. This means that tolerance sets 
Tp(x), Ip(r) from the formula (2) were replaced by fuzzy sets RTp(x), Rlp(r) with the 
following membership functions: 



^RTp(x)(r) = X ^ ’ 



(15) 
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HRip(r')(r) = X ('^q. (16) 

qi^p 

Due to the use of fuzzy sets, we had to apply fuzzy set theory’s operators instead 
of elassieal set theory’s operators - e.g. the following eardinality operator: 

VX c U card’^’’ (X) = ^|J,x (fj, ) , (IV) 

TjeU 



Employing fuzzy sets operators to definition of a rough inelusion, we introdueed 
the eoneept of a fuzzy-rough inelusion as an extension of a rough inelusion. After 
all these ehanges the final formula for deeision flinetion of the fuzzy-rough 
elassifier ean be written as: 

BXy)=*st’'"(RTA(4RiA,(0)*T^’^’’(RiA(^^^ (18) 

reU 



where |LL^ denotes the any fuzzy-rough inelusion. 

It ean be easily notieed that, when we have knowledge, whieh eontains only 
qualitative information, the presented above fuzzy-rough method is equal to the 
rough method (beeause there are no differenees between objeet of the same elass 
and therefore all fiizzy sets are erisp sets). 



4. Architecture of the RClass System 

The most important parts of the RClass system are the elassifieation meehanism, 
knowledge base and user interfaee. The elassifieation meehanism, based on rough 
and fuzzy-rough methods, is the kernel of the RClass system. The knowledge base 
eontains a deeision table represented by a set of MISO rules. The user interfaee is 
the standard interfaee for Windows applieations (see figure 1). These modules, 
whieh enable basie funetions of the RClass system, are eompleted by a knowledge 
base manager, knowledge analysing tools and an internal editor. The knowledge 
base manager is responsible for reading and writing knowledge bases from/to 
external files. It also allows importation of files from many well-known 
applieations. Knowledge analysing tools eonsist of several basie funetions of data 
analysis based on rough set theory, sueh as weights ealeulation, a reduet finding, 
and the quality of elassifieation ealeulation. The internal editor enables simple 
knowledge base modifieation (for ereating knowledge bases and deep modifieation, 
an external editor is suggested). The arehiteeture of the RClass system is deseribed 
in Fig.2. 
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Fig. 1. The user interface of the RClass system 
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Fig. 2. The architecture of the RClass system 

System RClass is a 32-bits object-oriented application working in a 
Windows95/NT environment. The object-oriented structure [4] makes later 
development straightforward and allows use of a variety of object-oriented software 
libraries. 32-bits architecture seems to be more stable and faster than the 16-bits 
one. The Delphi 3.0 32-bits object-oriented software environment with its Visual 
Component Library (VCL), was chosen to implement the RClass system, mainly 
because it can offer a diversity of data base management functions. 
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5. Numerical Example 

This example presents results obtained in RClass system on well-known Iris data 
set. This data set made by E. Anderson [1] eontains 150 samples belonging to one 
of three elasses (Iris Setosa, Iris Versieolor, and Iris Virginiea) with 50 samples 
eaeh. The instanees deseribe iris plant using four input features: sepal length (xi), 
sepal width (X2), petal length (X3), and petal width (X4). This data set is very simple 
- the first elass is linearly separable from other two elasses, our rough set analysis 
proved the strong relationships between deeision and attributes X3, X4 and a big 
exeess of used information. We sueeessfully applied our system to solve mueh 
more eomplieated elassifieation tasks, but we deeided to present the Iris problem, 
beeause it is often used as a benehmark. The Iris data set is also easily available on 
Internet network and ean be aehieved e.g. from address 
http://www.ies.uei.edu/-mleam/MLRepository.html. In order that elassifieation 
task not to be so trivial we used original data set redueed to two most signifieant 
input variables X3, X4. For our experiments, we prepared training set employing first 
25 instanees of eaeh elass, the remaining 75 samples of original data set were used 
as a testing set. 



Table 2. The knowledge base of fuzzy-rough classifier 



X3 


X4 


dee 


Strength of rule 


1.4 


0.2 
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25 


4.3 


1.3 
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25 


5.2 


1.7 
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12 
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2.2 
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13 



Table 3. The knowledge base of rough classifier 



X3 


X4 


dee 


Strength of rule 


1.5 


0.2 


1 


25 


4.3 


1.3 


2 


25 


5.7 


2.1 


3 


10 


4.9 


1.8 


3 


5 


4.9 


1.7 


3 


5 


5 


1.5 


3 


5 



From the training data set we extraeted 4 rules (see table 2) as a database of 
fuzzy-rough elassifier and 6 mles (see table 3) as a database of rough elassifier. In 
spite of sueh signifieant knowledge reduetion the results aehieved for both 
presented solutions were satisfaetory - 2.7% errors for fuzzy-rough elassifieation 
method and 5.3% errors for rough elassifieation method. The obtained results are 
similar to results aehieved in other systems on the same redueed data sets - e.g. 
neuro-fuzzy system NEFCLASS, presented in paper [6], eame out with 4% 
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incorrectly classified objects. The better result of fuzzy-rough method shows that 
even if we employ tolerance relation instead of equivalence relation it can cause the 
loss of some necessary information and therefore the lower classification quality. It 
proves that although rough set theory’s based analysis can be used for both 
qualitative and quantitative information, it is more specialized for processing 
qualitative data. 



6. Conclusions 

The rough classification system RClass presented above seems to be a universal 
tool for handling imprecise knowledge. The system is realised as a shell system, 
which potentially can be applied to various areas of human activity. There are many 
knowledge validation methods (weights calculation, data reduction and rule 
extraction) and two classification methods (rough and fuzzy-rough) implemented in 
the RClass system, therefore it can be used both as a classifier and as a pre- 
processor of knowledge. 
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Abstract: In this paper we open a new avenue for applications of the rough set 
concept to decision support. We consider the classical problem of decision 
under risk proposing a rough set model based on stochastic dominance. We 
start with the case of traditional additive probability distribution over the set of 
states of the world, however, the model is rich enough to handle non-additive 
probability distributions and even qualitative ordinal distributions. The rough 
set approach gives a representation of decision maker’s preferences in terms of 
then...'" decision mles induced from rough approximations of sets of 
exemplary decisions. 



1. Introduction 



Decisions under risk have been intensively investigated by many researchers (for a 
comprehensive review see [2]). In this paper, we present an approach to this problem 
based on the rough sets theory ([4]). Since decisions under risk involve data 
expressed on preference-ordered domains (larger outcomes are preferable to smaller 
outcomes), we consider the Dominance-based Rough Set Approach (DRSA) [3]. The 
paper has the following plan. Section 2 recalls basic principles of DRSA. Section 3 
introduces the rough sets approach to decision under risk. Section 4 presents a 
didactic example and section 5 contains conclusions. 



2. Dominance-Based Rough Set Approach (DRSA) 



For algorithmic reasons, knowledge about objects is represented in the form of an 
information table. The rows of the table are labelled by objects, whereas columns are 
labelled by attributes and entries of the table are attribute-values, called descriptors. 

Formally, by an information table we understand the 4-tuple S=<U,Q,Vf>, where 
(7 is a finite set of objects, g is a finite set of attributes, V = \JVq and Vq is a 

domain of the attribute q, and fiLIxQ^V is a total function such \hdi\.flx,q)e Vq for 

W. Ziarko and Y. Yao (Eds.); RSCTC 2000, LNAl 2005, pp. 160-169, 2001 . 
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every qe Q, xe U, called an information function [4]. The set Q is, in general, divided 
into set C of condition attributes and set D of decision attributes. In general, the 
notion of condition attribute differs from that of criterion because the scale (domain) 
of a criterion has to be ordered according to a decreasing or increasing preference, 
while the domain of the condition attribute does not have to be ordered. 

Assuming that all condition attributes q^C are criteria, let Sq be an outranking 
relation [5] on U with respect to criterion q such that xS^y means “x is at least as 
good as y with respect to criterion q'\ We suppose that Sq is a total preorder, i.e. a 
strongly complete and transitive binary relation, defined on U on the basis of 
evaluations f-.q). 

Furthermore, assuming that the set of decision attributes D (possibly a singleton 
[d]) makes a partition of U into a finite number of classes, let Cl={Clt, t^T}, 
r={ be a set of these classes such that each xe U belongs to one and only one 
Clt^CL We suppose that the classes are ordered, i.e. for all r,se T, such that r>s, the 
objects from Cly are preferred (strictly or weakly [5]) to the objects from C7^. More 
formally, if iS is a comprehensive outranking relation on U, i.e. if for all x,_ye V, xSy 
means “x is at least as good as f \ we suppose: [xe Clr, y^ Ch, ^>5] ^ [xSy and not 
jp5x]. The above assumptions are typical for consideration of a multiple-criteria 
sorting problem. 

The sets to be approximated are called upward union and downward union of 
classes, respectively; 

ClT = UC/,, , Clf = UC/.v , t=\,...,n. 

s<t 

The statement x e cif means “x belongs at least to class Clf, while x e cif 
means “x belongs at most to class Clf. 

Let us remark that Clf =Cln=^^ Cln-Cln and Clf -Cl i. Furthermore, for 
t=2,...,n, we have: 



Clti=U-Ch and Ch =C-Clti^ 

The key idea of rough sets is approximation of one knowledge by another 
knowledge. In Classical Rough Set Approach (CRSA) [4] the knowledge 
approximated is a partition of U into classes generated by a set of decision attributes; 
the knowledge used for approximation is a partition of U into elementary sets of 
objects that are indiscernible by a set of condition attributes. The elementary sets are 
seen as “granules of knowledge” used for approximation. 

In DRSA, where condition attributes are criteria and classes are preference-ordered, 
the knowledge approximated is a collection of upward and downward unions of 
classes and the “granules of knowledge” are sets of objects defined using dominance 
relation instead of indiscernibility relation. This is the main difference between 
CRSA and DRSA. 
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We say that x dominates y with respect to /’cC, denoted by xDpy, if for all 
qeP. Given PcC and xe U, the “granules of knowledge” used for approximation in 
DRSA are: 

• a set of objects dominating x, called P-dominating set, (x)={ye U: yDpx], 

• a set of objects dominated by x, called P-dominaled set. Dp Id: xDpy} . 

For any PqC we say that xe U belongs to Cl without any ambiguity if xe ci 
and for all the objects ye U dominating x with respect to P, we have ye Cl ? 
D'p{^)^Cl • Furthermore, we say thatyE^y could belong to Cl if there would 
exist at least one object xe eft such that y dominates x with respect to P, i.e. 
ye Dp(x). 

Thus, with respect to P^Q the set of all objects belonging to cf without any 
ambiguity constitutes the P-lower approximation of Cl ^ denoted by P(C/f) ^ and 
the set of all objects that could belong to Clf constitutes the P-upper approximation 
of c/f r. denoted by P{cf) : 

}. P{cf)= for 

^cif 

Analogously, using £)“(x) one can define P-lower approximation and P-upper 
approximation of Clt ■ 

P(C/f)={xEf/: D“ W ^C/f }, P(CVf)= \JOp{x), for ^=1,. 

x^Clf 

The P-boundaries (T'-doubtful regions) of Clf and Clf are defined as: 

Bnp{ Clf )= P{cf) - P(Clf ) , Bnp( Clf )= P{cf) - P(Clf) , for ^=1,...,^. 

Due to complementarity of the rough approximations [3], the following property 
holds: 

Bnp( Clf )-Bnp{ Clf_i ), for t=2,...,n, and Bnp{ Clf )-Bnp{ Clf_i ), for 



For every te T and for every PeC we define the quality of approximation of 
partition Cl by set of attributes P, or in short, quality of sorting: 

card f (7 - f U Bnp (c/f )ll card f C/ - f (J (c/f 

l. Ur )} ^ I, VteT- 

card {U ) card {U ) 
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The quality expresses the ratio of all P-correctly sorted objects to all objects in the 
table. 

Each iTiinimal subset PeC such that Yp[CI) = YcX^^) called a reduct of Cl and 
denoted by RED^^i . Let us remark that an information table can have more than one 
reduct. The intersection of all reducts is called the core and denoted by CORE^i . 

The dominance-based rough approximations of upward and downward unions of 
classes can serve to induce a generalized description of objects contained in the 
information table in terms of then.."' decision rules. For a given upward or 

downward union of classes, Clf or Clf , the decision rules induced under a 

hypothesis that objects belonging to P(C/f) or P{Clf) are positive and all the others 

negative, suggest an assignment to “at least class C//’ or to “at most class C//’, 
respectively; on the other hand, the decision rules induced under a hypothesis that 

objects belonging to the intersection P{Clf) r\P{Clf) are positive and all the others 
negative, are suggesting an assignment to some classes between C 4 and Clt {s<t). 

Assuming that for each qe C, (i.e. is quantitative) and that for each x,ye U, 
J{x,q)>fiy,q) implies xS^^Y (i*e. is preference-ordered), the following three types of 
decision rules can be considered: 

1 ) Dy-decision rules with the following syntax: 

If cmd ...Ax,qp)>rqp, then xe Clf ^ 

where i^={<7i,...,^^}cC, (r^i,...,r^p)e L^iXL^2><-XfVand teT\ 

2 ) ^-decision rules with the following syntax: 

if /x,^i)<r^i andAx,q2)^rq2 and ...fx,qp)<r^, then xe Clf , 

where ^i,...,r^^)e L^iXL^2X...xF^and teT\ 

3 ) T>^^-decision rules with the following syntax: 

if /(x,^i)>r^i and j{x,q2)>rq2 and ... and J{x,qk+d<rqk+i and ... 
J{x,qp)<rqp, then xe CIy^CIs+Y^, . .uC/^, 

where 0 ’={gi,...,gjt}cC, 0"-{qk+\,»^^,qp]QC, P-O'^O", O' and O" not 
necessarily disjoint, (r^i,...,r^)G K^iXF^2X...xF'^, s,teT such that s<t\ 

As it is possible that [qi,...,qk}r\{qk+u^-‘,qp}^^, in the condition part of a D><- 

decision rule we can have "%x,q)>rA and “/(x,^)<ry’, where rq<Cq, for some qeC. 
Moreover, if the two conditions boil down to 'Ax,q)-rq". 

Since each decision rule is an implication, by a minimal decision rule we 
understand such an implication that there is no other implication with an antecedent 
of at least the same weakness and a consequent of at least the same strength. 
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3. DRSA for Decision under Risk 

To apply rough sets theory to decision under risk, we consider the following basic 
elements: 

- a set S={si, S 2 , Sn} of states of the world, or simply states, which are supposed 
to be mutually exclusive and collectively exhaustive, 

- an a priori probability distribution P over the states of the world: more precisely, 
the probabilities of states Si, S 2 , Sn are pi, p 2 , Pn, respectively (pi+ p 2 + ...+ 

Pn=l, Pi>0, i=l,...n), 

- a set A={A], A 2 , ..., An,} acts, 

- a set X={X], X 2 , ..., Xrl of consequences or outcomes that for the sake of simplicity 
we suppose to be expressed in monetary terms and therefore XcR, 

- a function g: AxS^X assigning to each act-state pair (Ai, Sj)e AxS a consequence 

Xh^X, 

- a set of classes Cl={Cli, Ck, Clt), such that CbuCku ,..kjC\=A, ClpnClq=0 
for each p,qe { l,2...,t} with p^q; the classes of Cl are preference-ordered 
according to an increasing order of their indices, in the sense that for each 
Ai,AjE A, if Ai£ Clp and Aje Clq with p>q, then A^ is preferred to Aj, 

- a function e: A— > Cl assigning each act Ae A to a class Clje Cl. 

In this context, two different types of dominance can be considered: 

1) (classical) dominance', given Ai,Aje A, Ai dominates Aj iff for each possible state of 
nature act Ai gives an outcome at least as good as act Aj. More formally, g(Ai, 
Sk)>g(Aj, Sk), for each Ske S, 

2) stochastic dominance', given Ai,AjEA, for each outcome xeX, act Ai gives an 
outcome at least as good as x with a probability at least as large as act Aj. 

Case 1) corresponds to the case in which the utility is state dependent (see e.g. [6]) 
while case 2) corresponds to a model of decision under risk proposed by Allais [1]. In 
this paper we consider this second case. 

On the basis of an a priori probability distribution P, we can assign to each subset 
of states of world WcS (W 9^0) the probability P(W) that one of the states in W is 
verified, i.e. P(W) = Spi, and then to build up the set n of all the possible values 

iiSjGW 

P(W), i.e. n = {Tie [0,1]: 7i=P(W), WcSj. 

We define the following function z: AxS->n assigning to each act-state pair (A,, 
Sj)e AxS a probability Tie n as follows: 

z(Ai, Sj)= Xpr 

r:g(Ai,Sr)^g(Ai,Sj) 
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Therefore, z(Ai, Sj) represents the probability of obtaining an outcome whose value 
is at least g(Ai, Sj) by act Ai. 

On the basis of function z(Ai, Sj) we can define the function p: AxFl^X as 
follows: p(Ai, it) = min g(Ai, Sj). 

j:z(Ai,Sj)>Jt 

Thus p(Ai, 7t)=x means that by act Ai we can gain at least x with a given 
probability 7t. 

Using the function z(Ai, Sj), we can also define the function p’: AxIl^X as p’(Ai, 
71) = max g(Ai, Sj). 

j:z(Aj,Sj)<7C 

p’(Ai, 7i:)=x means that by act Ai we can gain at most x with a given probability iz. 

If the elements n, 0=7T(i), 71(2) ? ^(d)=l (d=card(n)), are reordered in such a way 

that 7r(i)<7T(2)< ... <7T(d), then we have p(Ai, 7C(j))= p'(Ai, l-7t:(j.i)). 

Therefore, p(Ai, 7t(j))<x is equivalent to p'(Ai, l-7t(j_i))>x, Aie A, 7r(j)en, xeX. 

Given Ai,Aje A, Ai stochastically dominates Aj if and only if p(Ai, 7T)>p(Aj, tt) for 
each Tie n. This is equivalent to say: given Ai, Aje A, Ai stochastically dominates Aj if 
and only if p'(Ai, 7i)<p'(Aj, 7i) for each Tie n. 

We can apply DRSA in this context considering as set of objects U the set of acts 
A, as set of attributes (criteria) Q the set flu {cl}, where cl is an attribute representing 
the classification of acts from A into classes from Cl, as set V the set XuCl, as 
information function f a function f such that f(Ai, 7i)=p(Ai, tt) and f(Ai,cl)=e(Ai). With 
respect to the set of attributes Q, the set C of condition attributes corresponds to the 
set n and the set of decision attributes D corresponds to (clj. 

The aim of this rough set approach to decision under risk is to explain the 
preferences of the decision maker represented by the assignments of the acts from A 
to the classes of Cl in terms of stochastic dominance expressed by means of function 

P- 



4. A Didactic Example 



The following example illustrates the approach. Let us consider 

- a set S={ Si, S 2 , S 3 } of states of the world, 

- an a priori probability distribution P over the sates of the world defined as follows: 
pi=.25, p2=.35, p3=.40, 

- a set A={ Ai, A 2 , A 3 , A 4 , A 5 , A^} of acts 

- a set X=(0, 10, 15, 20, 30} of consequences 
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- a set of classes Cl={Cli, CI2, CI3}, where Cli is the set of bad acts, CU is the set of 
medium acts, CI3 is the set of good acts, 

- a function g:AxS^X assigning to each act-state pair (Aj, Sj)e AxS a consequence 
xi,eX and a function e: A— > Cl assigning each act AieA to a class CljeCl 
presented in the following Table 1 . 



Table 1 Acts, consequences and assignment to classes from Cl 





Pi 


A, 


A2 


A3 


A4 


As 


As 


Si 


.25 


30 


0 


15 


0 


20 


10 


S2 


.35 


10 


20 


0 


15 


10 


20 


S3 


.40 


10 


20 


20 


20 


20 


20 


cl 




good 


medium 


medium 


bad 


medium 


good 



Table 2 shows the values of function p(Ai, Ji). 



Table 2 Acts, values of function p(Ai, 7 t) and assignment to classes from Cl 





A, 


A 2 


A 3 


A 4 


As 


Ae 


.25 


30 


20 


20 


20 


20 


20 


.35 


10 


20 


20 


20 


20 


20 


.40 


10 


20 


20 


20 


20 


20 


.60 


10 


20 


15 


15 


20 


20 


.65 


10 


20 


15 


15 


20 


20 


.75 


10 


20 


0 


15 


10 


20 


1 


10 


0 


0 


0 


10 


10 


Cl 


good 


medium 


medium 


bad 


medium 


good 



Table 2 is the data table on which the DRSA is applied. Let us give some examples 
of the interpretation of the values in Table 2 . If we consider the column of act A3 we 
have that by act A3 

• the value 20 in the row corresponding to .25 means that the outcome is at least 
20 with a probability of at least . 25 . 

• the value 15 in the row corresponding to .65 means that the outcome is at least 
15 with a probability of at least . 65 . 

• the value 0 in the row cori'esponding to .75 means that the outcome is at least 0 
with a probability of at least . 75 . 

If we consider the row corresponding to . 65 , then 

• the value 10 relative to Ai, means that by act Ai the outcome is at least 10 with a 
probability of at least . 65 , 

• the value 20 relative to A2, means that by act A2 the outcome is at least 20 with a 
probability of at least . 65 , 



and so on. 
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Applying rough set approach we approximate the following upward union and 
downward union of classes: 

CI 2 =Cl 2 uCl 3 , i.e. the set of the acts at least medium, 

CI 3 =Cl 3 , i.e. the set of the acts (at least) good, 

Clf =Cli, i.e. the set of the acts (at most) bad, 

CI 9 =CliuCl 2 , i.e. the set of the acts at most medium. 

The first result of the DRSA approach was a discovery that the data table (Table 2) 
is not consistent. Indeed, Table 2 shows that act A 4 stochastically dominates act A 3 , 
however act A 3 is assigned to a better class (medium) than act A 4 (bad). Therefore, 
act A 3 cannot be assigned without doubts to the set of the class of the at least medium 
acts as well as act A 4 cannot be assigned without doubts to the set of the class of the 
(at most) bad acts. In consequence, lower approximation and upper approximation of 

CI 7 , CI 3 and Clf , Clf are equal, respectively, to 



C(Cl!)={Ai,A2,A5,A6}=Clf-{A3}, C(Cl|)={Ai,A2,A3,A4,A5,A6] = Clfu{A4}, 

C(C 1 |) ={ Ai,A 6}= Cl| , C(C 1 |) ={ Ai,A 6)= Cl^ , 

C(Clf) = 0 =Clf -{A4}, C(Clf) ={A3,A4}=Clf u{A3}, 

C(Clf) ={ A2,A3,A4,A5}=Clf , C(Clf) ={A2,A3,A4,A5}=Clf . 

Since there are two inconsistent acts on a total of six acts (A 3 ,A 4 ), then the quality 
of approximation (quality of sorting) is equal to 4/6. 

The second discovery was one reduct of condition attributes (criteria) ensuring 
the same quality of sorting as the whole set n of probabilities: RED\^i ={.25, .75, 1 }. 
This means that we can explain the preferences of the decision maker using only the 
probabilities in RED\^i . RED\^i is also the core because no probability value in 
RED\^i can be removed without deteriorating the quality of sorting. 

The third discovery was a set of minimal decision rules describing the decision 
maker’s preferences [within parentheses there is a verbal interpretation of 
corresponding decision rule] (within parentheses there are acts supporting the 
corresponding rule): 

/) if p(Ai, .25)>30, then Cl j , 

[if the probability of gaining at least 30 is at least .25, then act Aj is (at least) 
good] (Ai), 
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2 ) if p(Ai, .75)>20 and p(Ai, 1)>10, then CI 3 , 

[if the probability of gaining at least 20 is at least ,75 and the probability of 
gaining at least 10 is (at least) I (Le, for sure the gaining is at least 10) , then act Ai 
is (at least) good] (A^), 

3) if p(A,, 1)>10, then Cl| , 

[if the probability of gaining at least 10 is (at least) 1 (Le. for sure the gaining is 

at least 10) , then act k^ is at least medium] (Ai, A 5 , A^), 

4) if p(Ai, .75)>20, then k-,e , 

[if the probability of gaining at least 20 is at least ,75 , then act k\ is at least 
medium ] (A 2 , A^), 

5) if p(Ai, .25)<20 (Le. p’(Ai, \)>2Q) and p(Ai, .75)<15 (i.e. p’(A[, .35)>15), then 
AiE Clf, 

[if the probability of gaining at most 20 is (at least) 1 (Le. for sure you gain at 
most 20) and the probability to gain at most 15 is at least .35, then act k[ is at most 
medium] (A 3 , A 4 , A 5 ), 

6 ) if p(Ai, 1)<0 (Le. p’(A;, .25)>0), then Aie Clf , 

[if the probability of gaining at most 0 is at least .25, then act k[ is at most 
medium [ (A 2 , A 3 , A 4 ), 

7) if p(A„ 1)>0 and p(Ai, 1)<0 (i.e. p(Ai, 1)=0) and p(Ai, .75)<15 (Le. p\Ai, 
.35)>10j, ^/z6^/7AiECl]UCl2, 

[if the probability of gaining at least 0 is 1 (Le. for sure the outcome is at least 0) 
and the probability of gaining at most 15 is at least .35, then act k{ is bad or 
medium, without enough information to assign k{ to only one of the two classes] 
(A 3 , A 4 ). 

Minimal sets of minimal decision rules represent the most concise and non-redundant 
knowledge contained in Table 1 (and, consequently, in Table 2). The above minimal 
set of 7 decision rules uses 3 attributes (probability .25, .75 and 1) and 11 elementary 
conditions, i.e. 26% of descriptors from the original data table (Table 2). Of course 
this is only a didactic example: representation in terms of decision rules of larger sets 
of exemplary acts from real applications are more synthetic in the sense of the 
percentage of used descriptors from the original data table. 
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5. Conclusions 



We introduced the rough sets theory of decisions under risk using the idea of 
stochastic dominance. The results are quite encouraging. Let us observe that we 
considered an additive probability distribution, but an extension to non-additive 
probability, and even to a qualitative ordinal probability, is straightforward. 
Furthermore, in case of the elements of set Yl are numerous (like in real case 
application), a subset O or a set of the most significant probability values (e.g. 
0, .1, .2, ..., .9, 1) can be considered. 
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Abstract. Consideration of preference-orders requires the use of an extended 
rough set model called Dominance-based Rough Set Approach (DRSA). The 
rough approximations defined within DRSA are based on consistency in the 
sense of dominance principle. It requires that objects having not-worse 
evaluation with respect to a set of considered criteria than a referent object 
cannot be assigned to a worse class than the referent object. However, some 
inconsistencies may decrease the cardinality of lower approximations to such an 
extent that it is impossible to discover strong patterns in the data, particularly 
when data sets are large. I hus, a relaxation of the strict dominance principle is 
worthwhile. The relaxation introduced in this paper to the DRSA model admits 
some inconsistent objects to the lower approximations; the range of this 
relaxation is controlled by an index called consistency level. The resulting 
model is called variable-consistency model (VC-DRSA). We concentrate on the 
new definitions of rough approximations and their properties, and we propose a 
new syntax of decision rules characterized by a confidence degree not less than 
the consistency level. The use of VC-DRSA is illustrated by an example of 
customer satisfaction analysis referring to an airline company. 



1. Introduction 

Rough sets theory introduced by Pawlak [6] is an approach for analysing information 
about objects described by attributes. It is particularly useful to deal with 
inconsistencies of input information caused by its granularity. The original rough set 
approach does not consider, however, the attributes with preference -ordered domains, 
i.e. criteria. Nevertheless, in many real-life problems the ordering properties of the 
considered attributes play an important role. For instance, such features of objects as 
product quality, market share, debt ratio are typically treated as criteria in economical 
problems. Motivated by this observation, Greco, Matarazzo and Slowinski [1,3] 
proposed a generalisation of the rough set approach to problems where ordering 
properties should be taken into account. Similarly to the original rough sets, this 
approach is based on approximations of partitions of the objects into pre-defmed 
categories, however, differently to the original model, the categories are ordered from 
the best to the worst and the approximations are constructed using a dominance 
relation instead of an indiscernibility relation. The considered dominance relation is 
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built on the basis of the information supplied by criteria. The new Dominance-based 
Rough Set Approach (DRSA) was applied to solve typical problems of Multiple- 
Criteria Decision Aiding (MCDA), i.e. choice, ranking and sorting (see e.g, [1,3]). 

In this paper, we consider a variant of DRSA used to multiple-criteria sorting 
problems, which concerns an assignment of objects evaluated by a set of criteria to 
some pre-defmed and preference-ordered decision classes. In this variant of DRSA, 
the sets to be approximated with the dominance relation are, so-called, upward and 
downward unions of decision classes. There are known encouraging results of its 
applications, e.g. to evaluation of bankruptcy risk [2]. 

The analysis of large real-life data tables shows, however, that for some multiple- 
criteria sorting problems the application of DRSA identifies large differences between 
lower and upper approximations of the unions of decision classes and, moreover, 
rather weak decision rules, i.e. supported by few objects from lower approximations. 
The reason is that inconsistency, in the sense of dominance principle, between objects 
X and y assigned to very distant classes, h and t, respectively, (x dominates y, while 
class h is worse than t) causes inconsistency (ambiguity) also with all objects 
belonging to intermediate classes (from h to i) and dominated by x. In such cases it 
seems reasonable to relax the conditions for assignment of objects to lower 
approximations of the unions of decision classes. Classically, only non-ambiguous 
objects can be included in lower approximations. The relaxation will admit some 
ambiguous objects as well; the range of this ambiguity will be controlled by an index 
called consistency level The aim of this article is to present a generalization of DRSA 
to variable consistency model (VC-DRSA). 

This kind of relaxation has been already considered within the classical 
indiscemibility-based rough set approach, by means of so-called variable precision 
rough set model (VPRS) [1 1]. VPRS allows defining lower approximations accepting 
a limited number of counterexamples controlled by pre-defmed level of certainty. 

The paper is organized as follows. In section 2, main concepts of VC-DRSA are 
introduced, including rough approximations, approximation measures and decision 
rules. An illustrative example presented in section 3 refers to a real problem of 
customer satisfaction analysis in an airline company. The final section groups 
conclusions. 



2. Variable Consistency Dominance-Based Rough Set Approach 
(VC-DRSA) 

For algorithmic reasons, infonnation about objects is represented in the form of an 
information table. The rows of the table are labelled by objects, whereas columns are 
labelled by attributes and entries of the table are attribute-values. Fonnally, by an 
information table we understand the 4-tuple S=<U,Q,Vf>, where L/ is a finite set of 
objects, Q is a finite set of attributes, V = \JVq and Vq is a domain of the attribute 

q, mdf.UxQ-^V \% a total function such \\\^\f{x,q)^V^ for every q^Q, xgU, called an 
information function [6]. The set Q is, in general, divided into set C of condition 
attributes and set D of decision attributes. 
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Assuming that all condition attributes q^C are criteria, let hq be an weak 
preference relation on U with respect to criterion q such that x'^^y means “jc is at 
least as good as 7 with respect to criterion q'\ We suppose that is a total preorder, 
i.e. a strongly complete and transitive binary relation, defined on U on the basis of 
evaluations /(•,^). 

Furthermore, assuming that the set of decision attributes D (possibly a singleton 
{d}) makes a partition of V into a finite number of decision classes, let C/={C/^, /eTj, 
r={ be a set of these classes such that each xEf/ belongs to one and only one 
class Cl^eCL We suppose that the classes are preference-ordered, i.e. for all r,seT, 
such that r>s, the objects from C7^ are preferred to the objects from CV^. The above 
assumptions are typical for consideration of a multiple- criteria sorting problem. 

The sets to be approximated are called upward union and downward union of 
classes, respectively: 



c/r=Uc/5, c/f = Uc/,.. t=\,...,n. 

s>t 

The statement x e cij means “x belongs at least to class Clf, while x e cif 
means “x belongs at most to class Clf. 

Let us remark that Cl\=Cli=U, Cln^Cln and Clf=Clj. Furthermore, for 
t=2,...,n, we have: 



C/f_i = u- Clt and Clt = U- C/f-i • 

The key idea of rough sets is approximation of one knowledge by another 
knowledge. In classical rough set approach (CRSA), the knowledge approximated is a 
partition of U into classes generated by a set of decision attributes; the knowledge 
used for approximation is a partition of U into elementary sets of objects that are 
indiscernible by a set of condition attributes. The elementary sets are seen as 
"'granules of knowledge'' used for approximation. 

In DRSA approach, where condition attributes are criteria and classes are 
preference-ordered, the knowledge approximated is a collection of upward and 
downward unions of classes and the “granules of knowledge” are sets of objects 
defined using a dominance relation instead of an indiscemibility relation. This is the 
main difference between CRSA and DRSA. Let us define now the dominance 
relation. 

We say that x dominates y with respect to P<^C, denoted by xDpy, if x:^q j; for all 

qeP. Given PqC and xeL, the “granules of knowledge” used for approximation in 
DRSA are: 

- a set of objects dominating x, called P-dominating set, Dp yDp^}, 

- a set of objects dominated by x, called P-dominated set, Op (x)={y^ U: xDpy}. 

For any P^C we say that x^U belongs to Clf with no ambiguity at consistency 

level /e(0, 1], if xe Clf and at least /*100% of all objects yeU dominating x with 
respect to P also belong to Clf , i.e. 
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The level / is called consistency level because it controls the degree of consistency 
between objects qualified as belonging to Clf without any ambiguity. In other 
words, if /<1, then (1-/)*100% of all objects ye (7 dominating .x with respect to P do 
not belong to Clf and thus contradict the inclusion of x in Clf . 

Analogously, for any T*eC we say that belongs to Cl ^Ith no ambiguity at 
consistency level /e(0, 1], if xe C/ and at least /*100% of all the objects ye (7 
dominated by x with respect to P also belong to cl , i-e. 




Thus, for any P^C, each object x^U is either ambiguous or non-ambiguous at 
consistency level / with respect to the upward union Clf {t=2,...,n) or with respect to 

the downward union Clf {t=\,...,n -1). 

The concept of non-ambiguous objects at some consistency level / leads naturally 
to the definition of P-lower approximations of the unions of classes Cl^ and Clf . 




Given P<^C and consistency level /, we can define the P-upper approximations of 
Cl and Clf , denoted by P ^ (c/f ) and P ^ (c/f ), by complementation of ^ ) 

and ^ ) with respect to U\ 



p'(cif)=u-p![cif_,) , p’(cAf)=u-p!(al,). 

P ^ (c/f ) can be interpreted as the set of all the objects belonging to Cl , possibly 
ambiguous at consistency level /. Analogously, P^(c/f ) can be interpreted as the set 
of all the objects belonging to Cl , possibly ambiguous at consistency level /. The P- 
boundaries {P-doubtful regions) of Clf and Clf are defined as: 

Bnp{ Clf )= P' (af )- Pf (cif ) , Bnp( Clf )= F' (cif )- P' (cif ), for t=\,...,n. 

The variable consistency model of the dominance-based rough set approach provides 
some degree of flexibility in assigning objects to lower and upper approximations of 
the unions of decision classes. It can easily be demonstrated that for 0</’</<! and 

^ (c/f ) c C (cif ) and F'' if if ) £ P' f /f ) • 
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The variable consistency model is inspired by Ziarko’s model of the variable 
precision rough set approach [11,12], however, there is a significant difference in the 

definition of rough approximations because ) and ) are composed of 

non-ambiguous and ambiguous objects at consistency level /, respectively, while 
Ziarko’s P'(c/,) and P‘{Cl,) are composed of /‘-indiscernibility sets such that at 
least /*100% of these sets are included in C/^ or have an non-empty intersection with 
Clf, respectively. If one would like to use Ziarko’s definition of variable precision 
rough approximations in the context of multiple-criteria sorting, then the P- 
indiscemibility sets should be substituted by P-dominating sets Dp{x), however, then 

the notion of ambiguity that naturally leads to the general definition of rough 
approximations (see [9]) looses its meaning. Moreover, bad side effect of a direct use 

of Ziarko’s definition is that a lower approximation ) may include objects y 

assigned to Clp ^ , where h is much less than t, ify belongs to Dp{x) that was included 

in P (c/~). When the decision classes are preference ordered, it is reasonable to 
expect that objects assigned to far worse classes than the considered union are not 
counted to the lower approximation of this union. 

Furthermore, the following properties can be proved: 



1) p'(af)=af^{xscif- 



card^ Dp (x) n C/,1 , j 
card^Dp(x)j 



</}= 



= Clf^{xsaf: 



card[Op (x) n Clf j 
carc/^Dp(x)j 



> 1 -/}, 



P' 



(c/f)=C/f u{xec/f+/: 



card[Dp{x)(^Clf^. j 



card\ 



</}= 



cardi Dp(x)nClf\ 
= af u{xe Clf ■ 

/(d;(x)) 



cardi 



2) p (c/, ) c Clf £ p'(cir), I (cif)^cif £ p'(cif). 

Due to complementarity of the rough approximations [3], also the following property 
holds: 



Bnp(Clf) = Bnp(Clf_^\ for /=2,...,«, and Bnp(Clf)=Bnp{Clf+^X for /=1, ...,/?- 1. 



For every teT and for every PeC we define the quality of approximation of partition 
Cl by set of criteria P, or in short, quality of sorting: 
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The quality expresses the ratio of all /"-correctly sorted objects to all objects in the 
table. 

Each minimal subset P<^C such that yp{ci) = rc{ci) is called a reduct of Cl and 
denoted by RED^i . Let us remark that an information table can have more than one 



reduct. The intersection of all reducts is called the core and denoted by CORE^i . 

Let us remind that the dominance-based rough approximations of upward and 
downward unions of classes can serve to induce a generalized description of objects 
contained in the information table in terms of then..P decision rules. For a given 

upward or downward union of classes, Clf or Clf , the decision rules induced under 

a hypothesis that objects belonging to ) (or are positive and all the 

others negative, suggest an assignment to “at least class C//’ (or to “at most class 
C//’)- They are called D>- (or D<) certain decision rules because they assign objects 
to classes without any ambiguity. Next, if upper approximations differ from lower 
approximations, another kind of decision rules can be induced under the hypothesis 

that objects belonging to P (c/f ) (or to P (c/f )) are positive and all the others 
negative. These rules are called D>- (or D<) possible decision rules suggesting that an 
object could belong to "at least class Clf " (or "at most class Clf "). Yet another 
option is to induce D><-approximate decision rules from the intersection 
P^{Clf)r\P^{Clf) instead of possible rules. For more discussion see [8]. 

Within VC-DRSA, decision rules are induced from examples belonging to 
extended approximations. So, it is necessary to assign to each decision rules an 
additional parameter a, called confidence of the rule. It controls the discrimination 
ability of the rule. 

Assuming that for each q^C, Vg<^R (i.e. Vg is quantitative) and that for each x,yG 
U,fx,q)>f{y,q) implies (i.e. Vg is preference-ordered), the following two basic 
types of variable-consistency decision rules can be considered : 



1) O'^-decision rules with the following syntax: 



if fix,q\)>rg\ and fx,qfl>rg 2 and ...fix,qp)>rgp, then xe Clf with confidence a. 



where P={qi,...,qp}<^C, (r^i,...,r^)e and t^T; 

2) D^-decision rules with the following syntax: 

if fx,q\)<rg\ andfx,q 2 )<rg 2 and ... fx,qp)<rgp, then xg Clf with confidence a. 



where P={qi,...,qp}<^C, (r^i,...,r^)G and tsT\ 

We say that an object supports a decision rule if it matches both condition and 
decision parts of the rule. On the other hand, an object is covered by a decision rule if 
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it matches the condition part of the rule. More formally, given a D>-rule p : if 
and fx,qi)>rq 2 and . . ./(x,^p)>r^, then x^Cl} , an object >^€^7 supports 
decision rule p iffXy,gi)>r^i and and ye Clf , while y is 

covered by p iff f(y,qi)>rqi and f{y,q 2 )>r^ and ... f(y,qp)>rqp. Similar definitions hold 
for D<-decision rules. 



Let Cover(p) denote the set of ail objects covered by the rule p. Thus, the 

confidence a of D>-decision rule p is defined as: ) 

card\Cover{p)) 



. For D<- 



decision rule the confidence is defined in a similar way. 

Let us remark that the decision rules are induced from F’-lower approximations 
whose composition is controlled by user-specified consistency level /. In 
consequence, the value of confidence a for the rule should be constrained from the 
bottom. It seems reasonable to require that the smallest accepted confidence of the 
rule should not be lower than the currently used consistency level /. Indeed, in the 
worst case, some objects from the P-lower approximation may create a rule using all 
criteria from P thus giving a confidence a>/. The user may have a possibility of 
increasing this lower bound for confidence of the rule but then decision rules may not 
cover all objects from the approximations. 

Moreover, we require that each decision rule is minimal. Since a decision rule is an 
implication, by a minimal decision rule we understand such an implication that there 
is no other implication with an antecedent of at least the same weakness (in other 
words, rule using a subset of elementary conditions or/and weaker elementary 
conditions) and a consequent of at least the same strength (in other words, rule 
assigning objects to the same union or sub-union of classes) with a not worse 
confidence a>/. 



Consider a D>-decision rule "if fix,q^>rqi and fx,q 2 )>rq 2 and .,.flx,qf)>rqp, then 

xeC/f " with confidence a. If there exists an object P={^i, qi, •••, qp} 

and /<a, such that /(y,^i)=r^i and /(y,<? 2 )^V and ->-f<y,qp)=rqp, theny is called basis 
of the rule. Each D>-decision rule having a basis is called robust because it is 
"founded" on an object existing in the data table. Analogous definition of robust 
decision rules holds for D<-decision rules. 

The induction of variable-consistency decision rules can be done using properly 
modified algorithms proposed for DRSA. Let us remind that in DRSA, decision rules 
should have confidence equal to 1. The key modification of rule induction algorithms 
for VC-DRSA consists in accepting as rules such conjunctions of elementary 
conditions that yield confidence a>l. Let us also notice that different strategies of rule 
induction could be used [10]. For instance, one can wish to induce a minimal and 
complete set of rules covering all input examples, or all minimal rules, or a subset of 
rules satisfying some user’s pre-defined requirements, e.g. generality or support. The 
details of one of the rule induction algorithms for VC-DRSA can be found in [4]. 
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3. Illustrative Example 

Let us illustrate the above concepts on a didactic example. The example refers to a 
real problem of customer satisfaction analysis [7] in an airline company. The 
company has diffused a questionnaire to its customers in order to get opinion about 
the quality of its services. Among the questions of the questionnaire there are three 
items concerning specific aspects of the aircraft comfort: space for hand luggage {qi\ 
seat w^idth {q^) and leg room {q^). Moreover, there is also a question about an overall 
evaluation of the aircraft comfort {d). A customer's answ^er on each of these questions 
gives an evaluation on a three grade ordinal scale: poor, average, good. 

The data table contains 50 objects (questionnaires) described by the set C={qu qi, 
qs} of 3 criteria corresponding to the considered aspects of the aircraft comfort and 
the overall evaluation D={d}. All criteria are to be maximized. The scale of criteria is 
number-coded: l=poor, 2=average, 3=good. The overall evaluation d creates three 
decision classes, which are preference-ordered according to increasing class number, 
i.e. C/i=poor, C/ 2 =average, C/s^good. The analysed data are presented in Table 1. 



Table 1. Customer satisfaction data table 
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The marketing department of the airline company wants to analyse the influence of 
the three specific aspects on the overall evaluation of the aircraft comfort. Thus, a 
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sample of questionnaires was analysed using VC-DRSA. As the decision classes are 
ordered, the following downward and upward unions of classes are to be considered: 

at most poor: C/f = {2,3,5,10,13,15,20,24,25,27,31,33,37,39,40,41,42,44,45,46,47}, 

at most average: CV|={1,2,3,5,6,7,8,10,1 1,12,13,15,16,17,18,20,22,23,24,25,26,27, 
28,29,30,3 1 , 32,33,34,35,36,37,38,39,40,4 1 ,42,43,44,45,46,47} ; 
at least average: C/|={1, 4, 6,7, 8,9, 11, 12,14, 16, 17,18, 19,21,22, 23,26, 28, 29,30, 32, 
34,35,36,38,43,48,49,50}, 
at least good - CV 3 ^ = {4,9, 14, 19,2 1,48,49,50}. 

Let us observe that in the data table there are several inconsistencies. For instance, 
object #3 dominates object # 6 , because its evaluations on all criteria < 71 , < 72 , gs are not 
worse, however, it is assigned to the decision class Ch worse than Ch to which 
belongs object # 6 . This means that the customer #3 gave an evaluation for all the 
considered aspects not worse than the evaluation given by customer # 6 and, on 
another hand, customer #3 gave an overall evaluation of the aircraft comfort worse 
than the overall evaluation of customer # 6 . There are 99 inconsistent pairs in the data 
table violating the dominance principle in this way. 

The data table has been analysed by VC-DRSA assuming the confidence level 
/=0.8. In this case, the approximations of upward and downward unions of decision 
classes are the following (the objects present in the lower approximations obtained for 
confidence level /=1 are in bold): 

C®* (C/f )={ 2 , 13, 1 5,25,3 1 ,37,40,42 } , 



c“'\c/f) ={2,3,5,6,7,8,10,12,13,15,17,19,20,24,25,26,27,30,31,33,34,35,36,37,38, 



39,40,41,42,43,44,45,46,47}, 



5rtc(C/f) ={3,5,6,7,8,10,12,17,19,20,24,26,27,30,33,34,35,36,38,39,40,41,43,44, 
45,46,47}; 

C®-* (C/j- ) = { 1 ,2,3, 5,6, 7,8, 1 0, 1 1 , 12, 13,1 5, 1 6,17, 1 8,20,22,23,24,25,26,27,28,29,30,3 1 , 
32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47}, 

(C/2 - ) ={ 1 ,2,3, 5,6, 7,8, 1 0, 1 1 , 12, 1 3, 1 5, 1 6, 1 7, 1 8, 1 9,20,22,23,24,25,26,27,28,29,30, 
3 1 ,32,33,34,35,36,37,38,39,40,4 1 ,42,43,44,45,46,47,48,49 }, 
e«^.*(C7-) = {19,48,49}; 

C® *(C7|) = { 1 ,4,9,1 1 ,1 4, 1 6, 1 8,21 ,22,23,28,29,32,48,49,50}, 

— 0.8 ^ 

C (C/| ) = { 1 ,3, 4, 5, 6, 7,8,9, 1 0, 1 1 , 1 2, 1 4, 1 6, 1 7, 1 8, 1 9,20,2 1 ,22,23,24,26,272,28,29,30, 
32,33,34,35,36,38,39,40,41,43, 44,45,46,47, 48,49,50}, 

54^C/| )={ 3, 5,6,7, 8, 12, 1 7, 19,20,24,26,27,30,33,34,35,36,38,39,40,4 1 ,43, 
44,45,46,47}; 

C°^C/3-)={4,9,I4,21,50}, 



C° * (C7|^ ) = {4,9, 1 4, 1 9,2 1 ,48,49,50 } , 
54^C/3-)={19, 48,49}. 
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The set of all robust decision rules having a confidence level a>0.8 was induced 
from the above approximations. Let us remark that rules having confidence a=l are 
the same as obtained with the DRSA rule induction algorithm. The induced rules are 
listed below: 

Rule 1. if (/(x,^i) <1) and ^1), then xe C/f [a=.83] 

Rule 2. //(/(x,^i) <1) and <1), then xe Cl^ [ot=l] 

Rule 3. if (fix,q{) <1) and (fix,q' 2 ) <1) and (fix,q^) <2), then xe C/f [a=l] 

Rule 4. if{{f{x,q\) <1), then xg Cd~ [a=0.89] 

Rule 5. if ((/(x,^i) <2), then xe C/~ [a=0.85] 

Rule 6. //'((/(x,^ 2 ) <2), then xg C/" [a=0.91] 

Rule 7. //((/(x,< 72 ) <1), then xg C7~ [oc=1] 

Rule 8. if (/(x,^i) <2) and ifx^q^) <2), then xg Cl~ [a=0.92] 

Rule 9. /f((/(x,< 73 ) <2), then xg Cd~ [a=0.92] 

Rule 10. if ((/(x,^ 3 ) <1), then xg C/“ [a=0.95] 

Rule 11. if{J{x,q^) <2) and{f{x,q^) <1), then xg C/- [a=l] 

Rule 12. /f(/(x,< 7 i) <2) and ifx.q^) <2), then xg C7“ [a=0.96] 

Rule 13. if (/(x,^ 2 ) ^2) and ifx^q^) <2), xg C/~ [a=0.93] 

Rule 14. fif{x,qfi>2) and {f{x,q^)>2), then xg C7 [a=0.82] 

Rule 15. // (/(x,^2)^3) and {fx,qfji>2), then xg C/ [a=l] 

Rule 16. if ij{x,qfi>2) and (/(x,^3)>3), xg Cl [a=0.9] 

Rule 17. if (/(x,<^i)>3) and ij{x,qfi>2) and (/(x,<^3)>2), then xg C/ [a=0.83] 

Rule 18. if ij{x,q2)>2) and (J{x,q^)>2), then xg Cl [a=0.8] 

Rule 1 9. if (/(x,<7i)>3) and (J{x,q^>3) and {f{x,q'^)>2\ then xg Cl [ot=l ] 

Rule 20. if (/(x,^i)>2) and {fix,qfi>3) and {fx,q-^)>2\ then xg Cl [a=l] 

Managers of the airline company appreciated the easy verbal interpretation of the 
above rules. For example rule 16 says that if seat width is at least average and leg 
room is (at least) good, then the overall evaluation of the comfort is at least average 
with a confidence of 90%, independently of the space for hand luggage. This rule 
covers 10 examples, i.e. 20% of all questionnaires. Let us remark that such a strong 
pattern would not be discovered by DRSA because of one negative example (#45) 
being inconsistent with four other positive examples (#16,18,22,49). 

Let us remark that relaxation of the confidence level has at least the following two 
positive consequences: 

1) it enlarges lower approximations, permitting to regain many objects that were 
inconsistent with some marginal objects from outside of the considered union of 
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classes: for instance six objects, #16,18,22,32,48,49, inconsistent with object 
#45 entered the lower approximation C (C/j); 

2) it discovers strong rule patterns that did not appear when the dominance was 
strictly observed in rule induction: for instance rule 16 above. 

The above aspects are very useful when dealing with large real data sets, which is 
usually the case of customer satisfaction analysis. 



4. Conclusions 

The relaxation of the dominance principle introduced in the dominance-based 
rough set approach results in a more flexible approach insensitive to marginal 
inconsistencies encountered in data sets. The variable-consistency model thus 
obtained maintains all basic properties of the rough sets theory, like inclusion, 
monotonicity with respect to supersets of criteria and with respect to the consistency 
level. The rough approximations resulting from this model are the basis for 
construction of decision rules with a required confidence. The variable-consistency 
model is particularly useful for analysis of large data sets where marginal 
inconsistencies may considerably reduce the lower approximations and prevent 
discovery of strong rule patterns. 

Acknowledgement 

The research of S. Greco and B. Matarazzo has been supported by the Italian 
Ministry of University and Scientific Research (MURST). R. Slowinski and J. 
Stefanowski wish to acknowledge financial support of the KBN research grant no. 
8T1 IF 006 19 from the State Committee for Scientific Research. 



References 

1. Greco S., Matarazzo B., Slowinski R., Rough Approximation of Preference 
Relation by Dominance Relations, ICS Research Report 16/96, Warsaw 
University of Technology, Warsaw, 1996. Also in European Journal of 
Operational Research 1 17 (1999) 63-83. 

2. Greco S., Matarazzo B., Slowinski R., A new rough set approach to evaluation of 
bankruptcy risk. In C. Zopounidis (eds.), Operational Tools in the Management of 
Financial Risk, Kluwer Academic Publishers, Dordrecht, Boston, 1998, 121-136. 

3. Greco S., Matarazzo B., Slowinski R., The use of rough sets and fuzzy sets in 
MCDM. In T. Gal, T. Stewart and T. Hanne (eds.) Advances in Multiple Criteria 
Decision Making, chapter 14, Kluwer Academic Publishers, Boston, 1999, 14.1- 
14.59. 




Variable Consistency Model of Dominance-Based Rough Sets Approach 181 



4. Greco S., Matarazzo B., Slowinski R., Stefanowski J., An algorithm for induction 
of decision rules consistent with dominance principle. In: Proc. 2"^ Int. 
Conference on Rough Sets and Current Trends in Computing, Banff, October 16- 
19, 2000 (to appear). 

5. Pawlak, Z., Rough sets, International Journal of Information <Sc Computer 
Sciences 11 (1982) 341-356. 

6. Pawlak, Z., Rough Sets. Theoretical Aspects of Reasoning about Data, Kluwer 
Academic Publishers, Dordrecht, 1991. 

7. Siskos Y., Grigoroudis E., Zopounidis C., Sauris O., Measuring customer 
satisfaction using a collective preference disaggregation model, Journal of Global 
Optimization, 12 (1998) 175-195. 

8. Slowinski R., Stefanowski J, Greco, S., Matarazzo, B., Rough sets processing of 
inconsistent information. Control and Cybernetics 29 (2000) no.l, 379-404. 

9. Slowinski R., Vanderpooten D., A generalized definition of rough approximations 
based on similarity. IEEE Transactions on Data and Knowledge Engineering, 12 
(2000) no. 2, 331-336. 

10. Stefanowski J., On rough set based approaches to induction of decision rules. In 
Polkowski L., Skowron A. (eds.) Rough Sets in Data Mining and Knowledge 
Discovery, vol. 1, Physica-Verlag, Heidelberg, 1998, 500-529. 

1 l.Ziarko W. Variable precision rough sets model. Journal of Computer and Systems 
Sciences (1993) no. 1, 39-59. 

12.Ziarko W. Rough sets as a methodology for data mining. In Polkowski L., 
Skowron A. (eds.) Rough Sets in Data Mining and Knowledge Discovery, vol. 1, 
Physica-Verlag, Heidelberg, 1998, 554-576. 




Approximations and Rough Sets Based on Tolerances 



Jouni Jarinen 



Turlai Centre for Computer Science (TUCS) 
Lcmminkaiscnkatu 14 A, FIN-20520 Turku, Finland 

j jarvine0cs .utu.fi 



Abstract In rough set theory it is supposed that the knowledge about objects is 
limited by an indiscernibility relation. Commonly indiscernibility relations are 
assumed to be equivalences interpreted so that two objects are equivalent if we 
cannot distinguish tliem by their properties. However, there are natural indiscerni- 
bility relations which ai'c not transitive, and here we assume that the knowledge 
about objects is restricted by a tolerance relation R. We study approximations, 
i^-definable sets, /^-equalities, and investigate briefly the structure of //-rough 
sets. 



1 Indiscernibility 

The rough set theory introduced by Z. Pawlak in the eai'ly eighties [15] deals with 
incomplete information about objects. More precisely, it considers situations in which 
objects may be indiscernible by their properties. A primary application area has been 
data mining. 

In rough set theory it is assumed that the knowledge about objects is restricted by 
an indiscernibility relation. Usually indiscernibility relations are supposed to be equiv- 
alences such that two objects are equivalent if we cannot distinguish them by our infor- 
mation. Since such an equivalence E induces a partition whose blocks are the equiva- 
lence classes, the objects of the given universe U can be divided into three classes with 
respect to any subset X CU: 

1. the objects, which surely are in X; 

2. the objects, which are surely not in X; 

3. the objects, which possibly are in X. 

The objects in class 1 form the lower £^-approximation of X, and the objects of 
type 1 and 3 form together its upper E-approximation. The E-boundary of X consists 
of objects in class 3. Subsets of U which are identical to both of their approximations 
are called E-definable. 

In this work we assume that the indiscernibility relations are tolerances. Namely, 
transitivity is not an obvious property of indiscernibility relations. For example, tran- 
sitive relations fail to capture the indiscernibility involved in the concept of heap as it 
appears in the Eubulide's paradox: “What is the smallest number of grains making a 
heap of grains?” (see [8], for example). 

In Pawlak’s information systems [14] indiscernibility relations which are equiv- 
alences arise naturally when one considers a given set of attributes: two objects are 
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equivalent when their values of all attributes in the set are the same. However, some of 
the natural indiscemibility relations encountered in nondeterministic information sys- 
tems are not necessarily transitive (see [13], for example). 

Example 1. Suppose that the set U — {1,...,7} consists of seven patients called 
1, 2, . . . , 7, respectively. Their body temperatures (temp), blood pressures (BP), and 
hemoglobin values (Hb) are given in Table 1 . 





temp 


BP 


Hb 


1 


39.3 


103/65 


125 


2 


39.1 


97/60 


116 


3 


39.2 


109/71 


132 


4 


37.1 


150/96 


139 


5 


37.3 


145/93 


130 


6 


37.8 


143/95 


121 


7 


36.7 


138/83 


130 



Table 1. 



Let us define an indiscemibility relation R so that two patients are E-related if their 
values for body temperature, blood pressure and hemoglobin are so close to each other 
that the differences are insignificant. The tolerance R is represented graphically by the 
following graph. 




For example, patients 1 and 2 are Ji-related since their values for body temperature, 
blood pressure, and hemoglobin are essentially the same. 

We end this section by noting that most of the results issued in this work are pre- 
sented in the author’s doctoral dissertation [6] where all proofs and many further facts 
can be found. We also remark that all our lattice theoretical notions and results appear 
in [1], for example. 



2 Approximations 

First we study approximations determined by tolerance relations. B. Konikowska [7] 
and J. A. Pomykala [18] considered approximation operations defined by strong similar- 
ity relations of nondeterministic information systems. Also J. Nieminen [9] has studied 
approximations induced by tolerances but his definition is not the same as ours. Fur- 
thermore, J. A. Pomykala [17] and W. Zakowski [21] have investigated related notions 
of approximations defined by covers. In [19] A. Skowron and J. Stepaniuk introduced 
tolerance approximation spaces and they presented in this context several results con- 
cerning particularly attribute reduction. J. Jarvinen considered in [5] dependence spaces 
induced by preimage relations. In this setting attribute dependencies induced by reflex- 
ive and symmetric indiscemibility relations can be studied. 
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If a binary relation ii on a set U is reflexive and symmetric, it is called a tolerance 
relation on U. The set of all tolerance relations on U is denoted by Tol(Z7). For all 
R G Tol(U) and a e U, the set a/R = {b eU \ aRb} is the R-neighborhood of a. 

Definition 2. Let C/ be a set of objects and let be a tolerance on U. The lower R- 
approximation and the upper R- approximation of a set X CU are defined by 

Xr = {x^U\xIR( 1X}- 
X^ = {x^U\x|Rf^Xi^^}, 

respectively. The set Br{X) = X^ — Xr is called the R-boundary of X. 

The set Xr (resp. X^) consists of elements which surely (resp. possibly) belong 
to X in view of the knowledge provided by R. The i?- boundary is the actual area of 
uncertainty. It consists of elements whose membership in X cannot be decided when 
i^-related objects cannot be distinguished from each other. 

Let us consider the tolerance R defined in Example 1 . The upper i?-approximation 
of X = {4, 5, 6} is X^ = {4, 5,6,7} and its lower i^-approximation is Xr = {4, 6}. 
Hence, the /^-boundary of X is Br{X) = {5, 7}. For example, 5 is in Br{X) since 
both in X and in there is an object which is indiscernible from 5. 

Our first proposition gives some basic properties of approximations. Note that the 
set of all subsets of a set U is denoted by p{U), and for any X CU, X^ = [/ — X is 
the complement of X. 

Propositions. IfR G Tol(f7) andX^Y C U, then 

(a) 0^ = 0^ = 0 andUR = U^ = U; 

(b) XkCXCX^- 

(c) {Xiif = (xC)« and 

(d) Br{X) = Br{X^); 

(e) XQY implies X^ C andXu C Yr. 

Note that Proposition 3(d) means simply that if we cannot decide when an object is 
in X, we obviously cannot decide whether it is in X^ either. 

Definition 4. Let V = (F, <) be an ordered set. A pair (^,^) of maps ^:P ^ P and 
P ^ P (which we refer to as the right map and the left map, respectively) is called 
a dual Galois connection on F if ^ and ^ are order-preserving and ^ P ^ 
for all p G P. 

It is now obvious that a dual Galois connection on F = (F, <) is a Galois con- 
nection between F and F^ = (F, >). The following proposition presents some basic 
properties of dual Galois connections, which follow easily from the properties of Ga- 
lois connections (see [1, 12], for example). But before that we define closure and interior 
operators. 

Let F = (F, <) be an ordered set. Then a map c: F — F is a closure operator 
if for all x,y e P, X < c{x), x < y implies c{x) < c{y), and c(c(ar)) = c{x). An 
element a; G F is called closed if c{x) = x. The set of all c-closed elements is denoted 
by Pc. It is well-known that Pc = {c{x) \ x G F}. If F -> F is a closure operator on 
F^ = (F, >), it is an interior operator on F. 
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Proposition 5. Let be a dual Galois connection on a complete lattice V. 

(a) For all P,p^^^ —p^ anclp^^^ —p^, 

(b) The map c\P ^ p ^ p^^ is a closure operator on V and the map k:P ^ 
P^ p^ p^^ is an interior operator on V, 

(c) If c and k are the mappings defined in (h), then restricted to the sets ofc-closed 
elements Pc and k-closed elements Pk, respectively, ^ and ^ yield a pair^: Pc 
Pky ^'Pk Pc of mutually inverse order-isomorphisms between the complete 
lattices (Pc? <) ond (Pfe, <). 

(d) If A C P , then (V A)^ = V{P^ \ P ^ ond (/\ A)^ = /\{p^ \ p G A}, 

(e) The kernel of the map ^ : P ^ P is a congruence on (P, V) such that the greatest 
element in the congruence class of any p E P is p^^, 

(f) The kernel of the map P ^ P is a congruence on (P, A) such that the least 

element in the congruence class of any p ^ P is 

We can now write the following proposition. 

Proposition 6. //P G Tol(C/), then (^,r) is a dual Galois connection on [pifj)^ C). 

Let ^ and ^ be maps on p{U). We say that is a an approximation pair, if 

there exists an R € Tol(?7) such that and X^ = Xr for all X C U. 

In [6] J. Jarvinen characterized the dual Galois connections on (p{U),C) which are 
approximation pairs. 

We end this section by presenting a corollary of Propositions 5 and 6. 

Coroiiary 7. Let R e Tol(f/), X CU,andHC p{U), 

(a) {{X^)r)^ = and {{Xr)^)r = Xr; 

(b) the map X i->- [X^)r is a closure operator and the map X i->- {Xr)^ is an 
interior operator; 

(C) (U I X e P} and (n V)r = f]{^R I X G n}; 

(d) ^ \Xen} and {[jn)R D \J{Xr | X G P}. 

We note that if E is an equivalence relation, then {Xe)e = {^e)^ = Xr and 
(X^)^ _ (^x^) E = X^ (cf. Corollary 7(a)). Note also that the inclusions in Corollary 
7(d) can be proper, and this holds for approximations defined by equivalences as well. 

3 Definable Sets 

In this section we consider P-definable sets, where P is a tolerance. 

Definition 8. Let P G Tol(?7). A set X C (7 is R-definable if Xr = X^. 

We denote by Def(P) the set of all P-definable sets. It is obvious that a set X is 
P-definable if and only if its P-boundary Br{X) is empty. 

In Example 1, Def(P) = {0,{1,2,3},{4, 5,6,7},C/}. For instance, the P- 
definable set {1,2,3} consists of the patients who probably have certain influenza. 

By Lemma 9(a), to show that a set is definable requires only half as much work as 
the definition suggests. Lemma 9(b) presents another interesting property of definable 
sets (cf. Corollary 7(d)). 
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Lemma 9. Let Re To\{U) and X,YCU. 

(a) X e Def(i^) iffX^ = X iffXR = X; 

(b) ifX e Det{R), then {X U Y)r = XrUYr and {X D Y)^ = X^ n Y^. 

Next we present some notions which we shall need. The set of all equivalences on U 
is denoted by Eq{U). We say that X CU is saturated by an equivalence E £ Eq({7), if 
X is the union of some equivalence classes of E or X = 0. The set of all sets saturated 
by E is denoted by Sat(E). A family E C p{U) is called a complete field of sets if 
0,C/ e X,X^ e JFforallX e JFandU^.n^ e JFforallH C T.VoxRe Tol(U), 
we denote by the smallest equivalence on U, which includes R. It is well-known 
and obvious that R^ = P|{^ ^ Eq(t/) | R C E}. 

Proposition 10. IfR G Tol(C/), then 

(a) Dei{R) = Sat(i?^); 

(b) Def(i?) is a complete field of sets. 

Note also that X^^ is the least i?-definable set including X, and that X^^ is the greatest 
i^-definable set included in X. 

Let E G Eq(f7). By Proposition 10(a), the E-definable sets are the unions of some 
(or none) E^-classes, and this actually is Pawlak’s original definition of E^-definable sets 
[15]. We also mention that the sets X^ and are E-definable for all X C U, but 
the approximations induced by tolerances are not necessarily definable. For instance in 
Example 1 , {2}^ = {1? 2} and {1, 2}^^ = {2}. 

4 Rough Equalities 

First we define different types of equalities based on approximations. For equivalence 
relations the corresponding notions were defined in [10, 11], where M. Novotny and 
Z. Pawlak also characterized all three types of rough equalities on finite sets. M. Steinby 
[20] has generalized these characterizations by omitting the assumption of finiteness, 
and in [6] J. Jarvinen characterized rough equalities defined by tolerances. 

Definition 11. Let it! G Tol(U). We define in p{U) the lower R-equality the upper 
R-equality and the R-equality =r by the following conditions: 

X r ^ x^ = 

X r x^ = 

X=rY ^ Xr = Yr and X^ = Y^. 

Because the pair (^,r) is a dual Galois connection on (p{U), C), the relation 
is a congruence on (p(U),U) such that the greatest element in the ss^-class of any 
X CUh (x«)r by Proposition 5. Analogously, the relation ^r is a congruence on 
(p(t/),n) such that the least element in the s^K-class of any X C U is (Xr)^, The 
relation =r is an equivalence on p{U), but it is not usually a congruence on (p((7), U) 
or on n), and this holds even if R is an equivalence. 

Let us denote by G{R) the set of the greatest elements of the i^^^-classes and C{R) 
denotes the set of the least elements of the ;^i?-classes. Now the next proposition holds. 
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Proposition 12. For any R e Tol([7), 

(a) g{R) = {Xh I X C [/} and C{R) = {X^ \ X C U); 

(b) the complete lattices C) and (C{R),C) are order-isomorphic. 

For an equivalence E (see e.g. [20]), G(E) = C{E) = Sat(E) = Def(E). Further- 
more, if El and E^ are different equivalences, then obviously differs from 

differs from and =Ei differs from = E 2 • We end this section by showing that 
different tolerances may define the same lower and upper equality. 

Example 13, Let U = {a, 6, c, d} and let R, 5, and T be tolerances on U such that 

a/ R = c/ S = b/T = {a, b,c}; bfR=dfS = afT= {a, b, d}] 

c/R = a/s = d/T = {a, c, d}; d/R = b/S = c/T = {6, c, d]. 

Now the relations and are equal, and they have the six congruence classes 

{0, {a}, . . {c, d}}, {{a, 6, c}}, {{a, 6, d}}, {{a, c, d}}, {{6, c, d}}, and {{/}. Simi- 
larly, the relations and are identical and they have six congruence classes 

{0}, {{a}}, {{&}}, {{c}}, {{d}}, and {{a, b}, It is obvious that also =r, =s, 
and =T are the same. They have 1 1 equivalence classes. 

5 Rough Sets 

In the classical setting a rough set may be defined as an equivalence class of sets which 
look the same in view of the of the knowledge restricted by the given indiscernibility 
relation, i.e., as a class of sets having the same lower approximation and the same upper 
approximation. Rough sets defined by a tolerance can be defined the same way. 

Definition 14. Let R G Tol(t7). The equivalence classes of the _R-equality =r are 
called R-rough sets. 

For studying the structure of the system of i?-rough sets it is convenient to adopt a 
fonnulation used (for equivalence relations) by T. B. Iwinski [4]. It is based on a fact 
that /^-rough sets can be equivalently viewed as pairs {Xr^ X^), where X CU, since 
each =i^-class C is uniquely determined by the pair (Xr, X^), where X is any member 
ofC. 

Let R G To\(U). For any X C U, the pair R{X) = {Xr,X^) is called the R- 
approximation of X, The set of all ii!-approximations of the subsets of U is A(R) = 
{R{X) \X CU}. 

Let R be the tolerance on [7 = {1, . . . , 7} considered in Example 1. For example, 
the sets {1, 5} and {2, 3, 4, 6, 7} belong to the same i^-rough set. Note that these sets 
are the complements of each other! This rough set can be represented by the pair (0, U). 
The other members of it are {2, 3, 5} and {1, 4, 6, 7}. 

There is a canonical order-relation <onp({7)xp(C/) defined by 



(Xi,X 2 ) < {Yi,Y 2 ) ^ XiCYi andXs C Fs- 

Because A{R) C p{U) x p(U) for all R G Tol(C/), the set A{R) may be ordered by <. 
M. Gehrke and E. Walker have shown in [3] that if £■ is an equivalence, then (A{E), <) 
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is a complete Stone lattice isomoiphic to (2^ x 3'^, <), where / = {a/E : \a!E\ = 1} 
and J = {a/E : \a/E\ > 1}. Here n denotes the set {0, 1, , , . , n - 1} ordered by 
0 < 1 - • < n - 1. 

The rough set systems defined by tolerances differ essentially from the ones defined 
by equivalences. Our final example shows that (A{R)^ <), where R £ Tol(i[7), is not 
necessarily even a semilattice when [/ > 5. 

Example 15, Let = {1, 2, 3, 4, 5} and let i? be a tolerance on U such that 

l/i?, = {1, 2}, 2/R ={1,2, 3}, 3/i?, = |2, 3, 4}, 4/i? = {3, 4, 5}, 5/i?, = {4, 5}. 

The Hasse diagram of {A{R)j <) is given in Figure 1. Note that {A{R)^ <) is not a 
join-semilattice because, for instance, the elements (1, 123) and (0, 1234) do not have a 
supremum. Similarly, (A{R)^ <) is not a meet-semilattice since the elements (12, 1234) 
and (1, //) do not have an infimum. 




Conclusions. We have studied approximations and rough sets based on tolerances. The 
motivation of this work was that there are natural indiscemibility relations which are 
not transitive. The main contribution of this paper is the observation that the proper- 
ties of approximations and rough sets defined by tolerances differ from the classical 
ones defined by equivalences. The biggest differences are that the tolerance approx- 
imations are not necessarily definable, that different tolerances may induce the same 
rough equalities, and that the ordered sets of tolerance rough sets are not necessarily 
even semilattices. In the end of each section these differences were briefly discussed. 
The future works are to study how our work can be applied for example to Pawlak’s 
information systems and to contexts in the sense of Wille [2]. 
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Abstract. We investigate a Rough Set approach to treating imperfect 
data in Inductive Logic Programming. Due to the generality of the lan- 
guage, we base our approach on neighborhood systems. A first-order 
decision system is introduced and a greedy algorithm for finding a set 
of rules (or clauses) is given. Furthermore, we describe two problems for 
which it can be used. 



1 Introduction 

Inductive Logic Programming (ILP) has been defined as the intersection between 
(inductive) Machine Learning and Logic Programming [9]. The field is concerned 
with finding hypotheses or sets of rules from examples, and first-order logic 
is used as representation language. Compared to traditional Machine Learning 
(ML) and Rough Sets (RS), ILP has some advantages. Firstly, a more expressive 
representation is used. This means that more complex concepts (e.g., that two 
attributes are equal) may be found. Secondly, ILP allows the use of background 
knowledge in a natural and compact way since such knowledge can be expressed 
as predicates. Thirdly, problems that are not handled well by traditional ML 
such as the multiple-instance problem [1] may be represented and solved in this 
framework. 

There have been some attempts to combine ILP and RS already. Siromoney 
and Inoue [12] introduce and formalize a method where two sets of clauses are 
found - one for the positive examples (or the positive region), and one for the 
negative examples (or the negative region). The sets are found by an ordinary 
ILP-system such as Progol [8]. Their approach is based on ILP, but not confined 
to ILP. It could easily be adopted to propositional problems as well, provided 
that there are only two decision classes. Stepaniuk [14] uses the same trans- 
formational scheme as in LINUS [3] and introduces a method for reducing the 
background knowledge by removing irrelevant clauses. The method assumes that 
the background knowledge is extensional (i.e., it consists of only ground atoms). 
Liu and Zhong [6] notice that imperfect data (such as uncertain, inconsistent, 
missing and noisy data) is not handled well in the ILP. They suggest several 
problem settings for RS in ILP, however, without offering any solutions. 

W. Ziarko and Y. Yao (Eds.): RSCTC 2000, LNAI 2005, pp. 190-198, 2001. 
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In this paper, we describe a framework based on RS for handling imperfect 
data in ILP, and give a greedy algorithm for finding a set of clauses. We assume 
that the reader is familiar with RS and Logic Programming. RS notions and 
notation follow Komorowski et al. [2] which is also a good introduction to RS. 

2 Inductive Logic Programming 

2.1 The Normal Setting 

In ILP, the goal is to find a hypothesis H from background knowledge B and 
examples E where B^ E^ and H usually are sets of definite clauses. The set E 
can be divided further into two sets, the positive examples E^ ^ and the negative 
examples, E~ . In the normal setting^ the following conditions must hold: Ve G 
E~ :B^e,\/eeE~ :BuH^e,3eeE^ :B^e,a.nd\/eeE^ :BuH^e, 
The normal setting is the main setting of ILP, but the definition given here 
is rather general. Additional restrictions are usually set. For example, only the 
definition of a single predicate is found, and this predicate is called the target 
predicate. Furthermore, the clauses are often assumed function-free since clauses 
with functions can be turned into function- free ones by means of flattening [11]. 
In this paper, we will use these restrictions. In addition, we will assume that 
is a set of normal non-recursive clauses. The clauses in E will be represented 
as tuples (a, 6) where a and b are ground, b corresponds to the body and a to 
the head. An example clause can be put in this form by first skolemizing it and 
then breaking the head and the body apart. 



2.2 Declarative Bias 

Since the language of clauses is infinite, all ILP systems restrict the language in 
some way. Most ILP systems employ a declarative bias so the restrictions can be 
set by the user. Mode declarations originate from Prolog, and are declarations 
of the kind of the terms that are allowed in a predicate. They have been used in 
several ILP systems including Progol [8]. We adopt a very similar scheme here. 
By position, we mean the place where a term occurs in an atom. So, term ti has 
position i (1 < i < n) in atomp(ti, . . . 

Definition 1. A mode declaration is a declaration of a predicate in either the 
head mode]i{pred{ti^ . . . An)) or the body modeh [recall^ pred[ti^ .. . An))- Each 
ti defines the type and the kind of term allowed at the corresponding position in 
the predicate. A h can have one of the following forms: 

-ht - the term must he an input variable of type t. 

-t - the term must be an output variable of type t. 

#t - the term must be a ground constant of type t. 

^ The normal setting is also called the strong setting, explanatory induction, and 
learning from entailment. 
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The recall (Inodeb only) restricts the number of times that a predicate with the 
same combination of input variables may appear in a clause. 

For example^ modeb{2 ^ owns {-[-per son stuff declares the first term of owns 
to be an input variable of type person and the second term to be an output 
variable of type stuff . For each combination of input variables, the predicate can 
be repeated twice. So owns{X^ B)^owns{X^C) is permitted, but owns{X^ 
owns{X^C)^owns{X^ D) is not. 

Let m{l) be a function associating a literal I with the mode declaration defin- 
ing it. Given atom a = p(si, . . . , s^), In{a) denotes the set of the input variables 
in a and their types, i.e., In{a) = {{x^t) \ m{a) = mode(b|h)(i{,p(ti, . . . ,t^)) 
and Si = X and f = +t, 1 < i < n}. Out(a), and Cons{a) are defined similarly 
for output variables and constants. For a set of atoms i^, the sets of input vari- 
ables, output variables, and constants are the unions of the corresponding sets 
for each atom in B^ e.g., In{B) = IJaeE 

3 Towards Rough Inductive Logic Programming 

Rough Sets Theory is based on propositional logic, and makes assumptions which 
are not true in first-order logic. When going from propositional logic to first-order 
these assumptions have be dealt with. 

Firstly, the indiscernibility relation assumes that all attributes can be con- 
sidered separately. This is not true in ILP. Attributes correspond to atoms, and 
atoms cannot generally be considered separately. For example, if B and Q are 
true, then so is PA Q, but the truth of 3 {P{x^y)) and 3 (Q(y, x)) does not imply 

3 (P(x, y) A Q{y^x)). Thus we have to consider the whole body of a clause. This 
corresponds to comparing information vectors in RS. 

Secondly, each object in the universe U (i.e., each example) satisfies one 
information vector or rule condition. This is not generally true in ILP. For each 
atom, we can create a positive and a negative literal, but they are not always 
complementary. So, an example may satisfy several bodies. For example, if P = 
{Q{a^b)^Q{a^c)^ R{b)} then both ci = B{X) ^ Q{X^Y) A R{Y) and C 2 = 
P{X) ^ Q{X^Y) A -iP(y) imply example P(a). However, the set {ci,C 2 } is 
complete in the sense that any example which follows from P{X) ^ Q(A, Y) 
also follows from at least one of them. Thus we need a more general relation than 
indiscernibility relation since the resulting clauses are not always complementary. 

4 A First-Order Decision System 

We are now ready to define a first-order equivalent of a decision system. 

Definition 2. A first-order decision system I is a tuple B is a 

set of definite clauses defining the background knowledge, E is a set of ground 
examples^ e = (a, 6)^ where b corresponds to the body and a to the head, M is 
a set o/ modeb declarations, Mh is a modeh declaration for the target predicate 
which occurs as the head in all of the examples in E, The target predicate is 
allowed in neither the bodies of the examples nor the background knowledge. 
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We will assume that C is a set containing the constants in B and and V 
is a countable set of variables, h denotes an atom of the target predicate defined 
by Mh and is instantiated from the variables in V, 

Prom M , C and V , we can define an atom set A, but not every possible atom 
should be allowed in A. For example, if mode(p(+t)) is a mode declaration, 
but p(A) G A is not connected to the head (i.e., X does not occur as an input 
variable in h or an output variable in A), p{X) will violate the mode declaration. 
The reason is that an input variable has to be instantiated. We define an atom 
set inductively. 

Definition 3. (Atom set) A is an atom set if A — $ or A — C B is an 
atom set^ and C C p{B) where p{B) is defined as: 

L p(<si, . . . , Sn) G p{B) if mode6(r,p(ti, . . . , E)) is in M and for 1 < i < n: 

(a) Si = xifti = -\-t^ ^ Out[B) 

(h) Si = X if ti = —t and x is a new variable which does not occur in B, 

(c) Si = c if ti = fit and c ^ C is of type t 

and p occurs with the same combination of input variables at most r — 1 
times in B 

2, (si = 52 ) G p{B) if m= modebfi^ t\ = ^ 2 ) and for i = 1,2; 

(a) Si = X if ti = -\-t, {xfi) G OutuiB) UlnM{h), 

(b) Si — X if ti — —t^ fifi) ^ Out{ti) 

(c) Si = X if ti = fit and x is a new variable which does not occur in B, 

p{B) defines a mapping from atom sets to atoms and is similar to a refinement 
operator in ILP. The predicate = /2 is treated specially for several reasons. 
Firstly, we want the language of attribute- value pairs (used in ML an RS) to be 
a sub-language of ours. Such pairs may be declared by mode declarations like 
modeb(l, Tperson = ^^^person), but the constants can only be assigned when 
bodies are created from the atom set. The reason is that we need to distinguish 
between different constant values rather than the truth values of the predicate. 
So, a position corresponding to a fit definition contains a variable in the atom 
set. Secondly, the output variables in an atom are new variables. It must be 
possible to unify or test the equality of such variables. In particular, it must be 
possible to unify them with the output variables in the head. This is done by 
the =/2 predicate which is the only predicate that is allowed to contain output 
variables from the head. 

The atoms in the atom set cannot be considered separately, but must be 
considered together with the rest of the atoms in the set. So, to evaluate how 
well each atom distinguishes between the examples, we need to create bodies 
from the atom set. 

Definition 4. (Body) Let A be an atom set, c is a body created from A if it 
is a conjunction of literals such that for each a in A ^ L is in c where I a = aa 
if m(a) = modeb(r, ti = ^ 2 ) and either t\ or t 2 are equal to fit and a is a 
substitution assigning x a constant of type t^ where {xfi) G Cons{a). Otherwise 
la is either a or -la. 
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When we use negation-as-failure, the new variables introduced by a negative 
literal are universally quantified. A clause such as p{X) ^ <?(A, Y) A Z) is 
an abbreviation ofYX{p{X) ^ 3YYZ{q{X^ y) A-ir(y, Z))). However, variable Z 
will not be instantiated by SLDNF-resolution if -ir(y, Z) succeeds. This means 
that the output variables of a negative literal cannot be used as inputs variables 
in other literals. Thus each literal in the body must be connected to the head 
through a chain of positive literals. Given a body, we call the maximal subset 
satisfying this property for an admissible body. 

Definition 5. (Admissible body) Let c be a body created from an atom set. Then 
d is an admissible body if d is the maximal subset of c such that for each literal 
I in cf ln{l) C (in(A) Out{pos[d)) where pos[d) contains the positive literals 
in d . 

Definition 6. (Body set) Let A be an atom set. A body set BS[A) is the set of 
all admissible bodies created from A. 



5 A Neighborhood System for ILP 

The coverage Cov[r) of a rule r is the set of objects which satisfies the condition 
of the rule. In a propositional decision system, each object satisfies only the 
condition of one rule and two objects are indiscernible if they satisfy the same 
condition. Thus the coverage of a rule is equal to the elementary set of objects 
satisfying its condition. As mention in Section 3, an object may satisfy several 
clause bodies when only a single atom is negated. Therefore we have no such 
relationship between elementary sets and coverages in our situation. Moreover, it 
is not possible to define an equivalence relation since the coverages may overlap. 
However, we may define a neighborhood system [4, 5] which is a generalization 
of RS. 

A neighborhood system {NS) is a mapping from a set V to collections of 
subsets of U : p E V ^ {Np} C 2^^ where the NpA are subsets of U which 
are associated with p. In our problem both U and V correspond to the set of 
examples Np corresponds to the coverage of some body which is satisfied by 
example p. We can define a neighborhood system as follows. 

Definition 7. (Coverage) Let c be a body^ and h the target atom. The coverage 
of c is Cov{c) = {(a, b) E E \ M{BUb) \= 3{c0) and a = hO}. 

Definition 8. A neighborhood system N S^s{ A) {^) E a mapping from examples 
to coverages^ N S^s(A){^) = {Cov{c) \ e E Cov{c) and c E BS{Ad)}. 

The upper and lower approximations of a neighborhood system are: 

AX = {ceE\ 3Cov{c) e NS{e) : Cov{c) C X} 

~AX^{ceE\ dCov{c) E NS{e) : Cov{c) C\X^$} 
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,(^gf(adam,cathy)^"^ gf(bill,donald)) ^ 
- \ 



0 ^ 

^ -^f(adam,bill)^ 



gf(anne,cathy) gf(anne,beth) 



Fig. 1. Coverages of the body set in Example 1. 



Example L We would like to find a definition of grandfather (denote by gf / 2 ) 
from the predicates father /2 and parent j 2 given the following background 
knowledge and examples: 



B 






^ parent{X^ Y) ^ fatherfX^ L), 
parent[X^Y) ^ motherfX^Y)^ 
mother [anne^ bill) ^ 
motherfanne^ beth)^ 
motherfcathy ^ donald)^ 
motherfcathy ^ sue) 

( gf[adam^cathy) (ci) 

\ gf{bill^ donald) (02) 



f atherfadam^ bill)^ 




f atherfadam^ beth)^ 




father{bill^ cathy)^ 




[ g f {anne ^ cathy) 


(63) 


E~ = < gf(anne^beth) 


( 64 ) 


1 gf{adam^ bill) 


(65) 



Assume that the target atom is h = gf[X^Y)^ and that we have found the 
following atom set A = {father[X^ V)^parent[V^ VL), W = Y}. Then the body 
set is: 



BS{A) 



father[X/Z) A parent[Z^ IT) A IT = Y 


(61) 


father[X/Z) Aparent[Z^ IT) A IT ^Y 


(62) 


father[X^ Z) A ^parent[Z^ IT) 


(63) 


->father{X^ Z) 


(64) 



This body set has the coverages depicted in Fig. 1 . The neighborhood system 
constructed from them is: NS[ei) = {{ei, 62}, {ei, 63}}, NS[e2) = {{61,62}, 
{62,63}}, VS'(e3) = {{61, 63}, {62, 63}}, NS{e4) = {{64,65}}, and NS{es,) = 
{{64,65}}. The lower and upper approximation for T’+ and are: AT’+ = 
{64,62}, AE+ = {61,62,63}, AE~ = {64,65}, and AE~ = {63,64,65}. 



6 Two Problems of Finding Clauses 

Given a first-order decision system there are basically two problems that can be 
solved depending whether we want find a definition of a predicate or a function. 
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6.1 The Predicate Problem 

Usually, the ILP problem has two decision classes - the positive and the negative 
examples. Hence, we may define a function d(e) which denotes the decision class 
as d(e) = +, if e G ^ and d(e) = — , if e G E~ . Since we have only two classes, 
it is not necessary to find clauses for both of them. We can create clauses that 
predict one of the classes and assume that an instance is in the other class if it 
is not predicted by the clauses. So, there are two ways to generate clauses: 

1. Find one set of clauses which explicitly predicts both positive and negative 
instances of the predicate. 

2. Find two sets of clauses which correspond to the upper and lower approxi- 
mation of the positive examples. The negative instances can be predicted as 
not following from these sets, i.e., by means of the closed world assumption 
(CWA). 

6.2 The Function Problem 

In the problem above, there are just two decision classes, but a decision system 
A = (U, AU {d}) has usually more than two classes. Moreover, each information 
vector has only one associated decision class, if the system is consistent. This 
corresponds to learning a functional predicate where the set A is a set of input 
variables and d is an output variable. A predicate is functional if each binding 
of the input variables has at most one binding of the output variables. The 
problem of finding such a functional predicate can be formulated as follows: 
Find a hypothesis H such that V(a, b) £ E : BUbU H \= o. and V(a, b) £ E Wa : 
B U b U H \= hOinO implies hOina = a. Here, h is an instance of the target 
predicate with new variables at each position such that, hO = and Oin is a 
subset of 0 containing the substitution for the input variables, a denotes some 
substitution binding all variables in h that correspond to constants and output 
variables. 

This problem can be solved in our framework. The decision class of an ex- 
ample e = (a, 6) is just the substitution corresponding to the output variables 
and the constants in a. Hence, the decision class of e is d(e) = and 

E'^ = {e G E I d(e) = i} contains the examples with class i. 

7 A Greedy Algorithm for Finding a Set of Clauses 

Finding a reduct is NP-hard [13], and this is a subproblem of finding a set of 
rules. Since a (propositional) decision system can be mapped into our first-order 
decision system, finding a set of clauses must be NP-hard, as well. Thus we have 
to use an approximation algorithm. 

A greedy algorithm for finding a set of clauses is given in Algorithm 1. It 
maintains an atom set B and a body set BS and iteratively adds the atom 
that discerns the most examples, to the atom set. The body set is extended 
correspondingly. Since an atom set can be infinite, a maximum depth bound k 
is set on the atoms. 
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Algorithm 1 A greedy algorithm for finding a set of clauses 
Input: A first-order decision system (13, i3, M, M/^), and a max. depth k. 

Output: A set of clauses H. 

1: L = p(Xi, . . . , Xn) is defined by Mh and the Xi are distinct variables 
2: R= {(e,, ej) G \ d{ei) / d{ej)}), ^ = 0, = 0 

3: w^hile it / 0 do 

4: Vg G p{B) if Depth[B U {g}, h) < k then compute R{q) = it Pi Dis^BSq)) where 

BSq = {g A 6, -ig A 6 I 6 G BS and (“i)^ A 6 is admissible} 

5: pick the p with the highest |it(p)| 

6: if R{p) = 0 then break the while loop 

7: BS = {b £ BSp \ b has the highest accuracy among the bodies covering e G E} 

8: B = BU {p}, R = R\Dis{BS) 

9: ii = {head(b) ^ 6 | 6 G BS} where head{b) = y {h[d{e)) \ e G Cov(b)} and 
h(x) = A if X = +, h[x) = -I A if X = — , and h[x) = hO if x is a substitution 0 



Definition 9. (Depth) Let h he a target atom. The depth of variable v in atom 
set Ay Depth{y^ A^h) is 

1, 0, if V occurs in h, 

2, msiX (^up) el n(a) Depth{u^C^h) A f if {vA') G Out[a) and In{a) C (Out(C) U 
in(fi)) for some subset C of A, 

3, oOy otherwise. 

The depth of atom set A is the maximum depth of its variables ^ Depth[A^ h) = 
^t)eVar(C) Depth{y ^ A^h) where Var[C) = In[C) UOut[C), 

For example, Depth[{q{X^Y)^q(Y^ Z)}^p[X)) = 2 and variable Y has depth 1, 
given the following mode declarations: mode]i(g(+fi)) and modeb(l, — fi)). 

An example is discerned from example Cj if there is no body which is 
satisfied by both or there is some body that is satisfied only by and examples 
of the same decision class as e^, but not Cj, More formally, {ei^Cj) G Dis[BS) 
iff d[ei) A 

1. NSpsiei) n NSpsi^j) = 0, or 

2. there is an Cov{c) G N SBs{ei)\Af Spsi^j) such that Cov{c) C 

To compute the discernibility Dis[BS{B)) of an atom set S, the coverage of each 
body in the body set BS[B) has to be found. However, the body set contains at 
worst 2 1^1 bodies, and is not bounded by the examples since the same example 
may satisfy several bodies. Thus the body set (BS) is reduced in line 7 such that 
only the best body for each example is kept. 

In RS, we have an initial set of condition attributes A and the goal is to find a 
subset B such that they have the same positive region. However, in a first-order 
decision system there are no initial atom sets like A, So, instead all examples are 
initially assumed indiscernible and the algorithm terminates when all examples 
have been discerned (R = 0), or R cannot be reduced any further {R{p) = 0 for 
all p G p{B)), In this latter case the examples that cannot be discerned by the 
corresponding body set are assumed indiscernible. Further details can be found 
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8 Conclusion and Future Work 

We have given a framework for ILP based on Rough Set Theory, and a greedy 
algorithm for finding a set of clauses. Our approach should be useful when the 
examples are indistinguishable due to sparse or missing background knowledge. 
In the future, we will investigate applications of this framework. We will also 
pursue other approaches to rough inductive logic programming. Especially, find- 
ing a reduct seems like an interesting problem. This is not possible with the 
algorithm presented here since the body set is reduced greedily. This means that 
there may be other bodies that discern better between the examples than those 
in the body set found by the algorithm. So even if we create an initial atom 
set (e.g. the set of all atoms with depth < A:), we cannot guarantee that the 
positive regions of this set and the set found by the algorithm, will be exactly 
identical (though in practice they will be close). However, we believe that finding 
a reduct is achievable if another representation is chosen for the body set and 
are currently investigating this. 
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Abstract. In the paper infinite information systems are investigated 
which are used in pattern recognition, discrete optimization, computa- 
tional geometry. The depth and the size of deterministic and nonde- 
terministic decision trees over such information systems are studied. A 
partition of the set of all infinite information systems on two classes is 
considered. Systems from the first class are near to the best from the 
point of view of deterministic and nondeterministic decision tree time 
and space complexity. Decision trees for systems from the second class 
have in the worst case large time or space complexity. In proofs (which 
are too long and not included to the paper) methods of test theory [1, 
2] and rough set theory [8, 9] are used. 



1 Introduction 

Decision rules and deterministic decision trees are widely used in different appli- 
cations for problem solving and for knowledge representation. One can interpret 
a complete decision rule system (a system which is applicable for any input) as 
a nondeterministic decision tree. In this paper both deterministic and nondeter- 
ministic decision trees are studied. 

In rough set theory finite information systems are considered usually. How- 
ever the notion of infinite information system is helpful in pattern recognition, 
discrete optimization, computational geometry [3, 6]. In the paper arbitrary in- 
finite information systems are considered. 

Rough set theory allows to describe problems of different nature such that 
the solution of the problem may be obtained exactly or approximately if we know 
values of attributes from a finite set which forms the description of a problem. 
The efficiency of decision trees for such problem solving depends on their time 
and space complexity. From [4, 5, 7] and from results of this paper follows that 
for an arbitrary infinite information system in the worst case 

- the minimal depth of deterministic decision tree (as a function on the num- 
ber of attributes in a problem description) either is bounded from below by 
logarithm and from above by logarithm to the power 1 + e, where e is an arbi- 
trary positive constant, or grows linearly; 

- the minimal depth of nondeterministic decision tree (as a function on the 
number of attributes in a problem description) either is bounded from above by 
a constant or grows linearly; 
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- the minimal number of nodes in deterministic and nondeterministic decision 
trees (as a function on the number of attributes in a problem description) has 
either polynomial or exponential growth. 

In the paper a classification of infinite information systems is considered. All 
such systems are divided on two classes. 

For information systems from the first class 

- there exist deterministic decision trees whose depth grows almost as loga- 
rithm, and the number of nodes grows almost as a polynomial on the number of 
attributes in a problem description; 

- there exist nondeterministic decision trees whose depth is bounded from 
above by a constant, and the number of nodes grows almost as a polynomial on 
the number of attributes in a problem description. 

The information systems from the first class are near to the best from the 
point of view of decision tree use. From stated above follows that for these 
systems the growth of the depth of decision trees is almost minimal, and the 
growth of the number of nodes in decision trees is almost minimal too. In the case 
when the number of attributes in a problem description is relatively small the 
considered decision trees may be used in practice. Note also that for information 
systems from the first class there exist complete decision rule systems for which 
the length of rules is bounded from above by a constant and the number of 
rules grows almost as a polynomial on the number of attributes in a problem 
description. 

For an arbitrary information system from the second class in the worst case 

- the minimal depth of deterministic decision tree (as a function on the num- 
ber of attributes in a problem description) grows linearly; 

- nondeterministic decision trees have at least linear growth of the depth 
or have at least exponential growth of the number of nodes (depending on the 
number of attributes in a problem description) . 



2 Basic Notions 

Let A be a nonempty set, be a finite nonempty set with at least two elements, 
and F be a nonempty set of functions from A to i^. Functions from F will 
be called attributes and the triple U = (A, 5, F) will be called an information 
system. 

The set A may be interpreted as the set of inputs for problems over the 
information system U . A problem over U is an arbitrary (n + 1) -tuple z = 
{v, /i, ■ ■ ■ , fn) where j/ : i?" ^ N, N is the set of natural numbers, and /i, . . . , G 
F. The number dimz = n will be called the dimension of the problem z. 
The problem z may be interpreted as a problem of searching for the value 
z(a) = //(/i(a), . . . , fn{ci)) for an arbitrary a G A. Different problems of pattern 
recognition, discrete optimization, fault diagnosis and computational geometry 
can be represented in such form. 

As algorithms for problem solving we will consider decision trees. A decision 
tree over t/ is a marked finite tree with the root in which the root and the 
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edges starting in the root are assigned nothing; each terminal node is assigned 
a number from N; each node which is neither the root nor terminal (such nodes 
are called working) is assigned an attribute from F; each edge is assigned an 
element from B. A decision tree is called deterministic if the root is initial node 
of exactly one edge, and edges starting in a working node are assigned pairwise 
different elements. 

Let F be a decision tree over U . A complete path in F is an arbitrary sequence 
^ do, 1 ^ 1 , di, . . . , Vm: dm: of uodes and edges of F such that vq is the 

root, Vmpi is a terminal node, and Vi is the initial and Vi^i is the terminal 
node of the edge di for i = 0, . . . , m. Now we define a subset .4(^) of the set 
A associated with If m = 0 then = A. Let m > 0, the attribute fi be 
assigned to the node Vi and F be the element from B assigned to the edge F, 
i = Then A(C = {a : a e A, f\{a) = Si,.. .,fm{a) = 

We will say that a decision tree F over U solves a problem z over U nonde- 
terministically if for each a E A there exists a complete path ^ in F such that 
a E and for each a E A and each complete path ^ in F such that a E ^4(^) 

the terminal node of the path ^ is assigned the number z(a). 

We will say that a decision tree F over U solves a problem z over U de- 
terministically if F is a deterministic decision tree which solves the problem z 
nondeter minist ically. 

As time complexity measure we will consider the depth of a decision tree 
which is the maximal number of working nodes in a complete path in the tree. 
As space complexity measure we will consider the number of nodes in a decision 
tree. We denote by h(F) the depth of a decision tree F, and by L(F) we denote 
the number of nodes in F. 

3 Classification 

Consider an information system U = (A, F,F). If F is an infinite set then U is 
called an infinite information system. 

We will say that the information system U has infinite independence dimen- 
sion (or, in short, infinite I- dimension) if the following condition holds: for each 
t G N there exist attributes /i, . . . , A G F and two-element subsets Fi, . . . , 
of the set F such that for arbitrary G Fi, . . . , F ^ the system of equations 

{fi{x) = 6i,...,ft{x)=St\ (1) 

is compatible (has solution) on the set A. If the considered condition does not 
hold then we will say that the information system U has finite Fdirnension. 

Now we consider the condition of decomposition for the information system 
U . Let t G N. A nonempty subset D of the set A will be called (t, F)-set if 
D coincides with the set of solutions on A of a system of the kind (1), where 
/i, . . . ,/t G F and F, • • • ,F € F. 

We will say that the information system U satisfies the condition of decompo- 
sition if there exist numbers m,p G N such that every (m + 1, F)-set is a union 
of p sets each of which is (m, F)-set. 
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The considered classification divides the set of infinite information systems 
on two classes: C\ and C2. The class C\ consists of all infinite information systems 
each of which has finite I-dimension and satisfies the condition of decomposition. 
The class C2 consists of all infinite information systems each of which has infinite 
I-dimension or does not satisfy the condition of decomposition. 

4 Bounds on Time and Space Complexity 

In the following theorem time and space complexity of deterministic decision 
trees are considered. 

Theorem 1 . Let U = he an infinite information system. Then the 

following statements hold: 

a) if U G Cl then for any 0 < e < C there exists a constant c G N 

such that for each problem z over U there exists a decision tree F over U which 
solves the problem z deterministically and for which h{F) < c(log2 + 1 and 
L{F) < where n = dimz; 

b) if U G Cl then there is no n E such that for each problem z over U 
with d\mz = n there exists a decision tree F over U which solves the problem z 
deterministically and for which h[F) < log|^|(n+ 1); 

c) if U G C2 then there is no n ^ ^ such that for each problem z over U 
with d\mz = n there exists a decision tree F over U which solves the problem z 
deterministically and for which h[F) < n. 

In the following theorem time and space complexity of nondeterministic de- 
cision trees are considered. 

Theorem 2 . Let U = [A^B^F) be an infinite information system. Then the 
following statements hold: 

a) if U G Cl then for any 0 < e < there exist constants ci,C2 G N 
such that for each problem z over U there exists a decision tree F over U which 
solves the problem z nondeterministically and for which h[F) < q and L{F) < 
|^|c2(iog2 where n = dimz; 

b) if U G C2 then there is no n ^ ^ such that for each problem z over U 

with d\mz = n there exists a decision tree F over U which solves the problem z 
nondeterministically and for which h[F) < n and L{F) < . 

So the class Ci is very interesting from the point of view of different applica- 
tions. The following example characterizes both the wealth and the boundedness 
of this class. 

Example 3 . Let m,t G N. We denote by Pol{m) the set of all polynomials 
which have integer coefficients and depend on variables xi, . . . We denote 
by Pofimfi) the set of all polynomials from Pol{m) such that the degree of 
each polynomial is at most t. We define information systems Lf{m) and Lf{mfi) 
as follows: Lffm) = (R^,T^,F(m)) and U[mfi) = (R^, F(m,t)) where R 
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is the set of real numbers, E = {— 1,0, -fl}, F[m) = {sign(p) : p G Pol[m)} 
and E{mp) = {sign(p) : p G Po/(m,t)}. One can prove that U [m) G C 2 and 
U{mp) G C\. Note that the system U{m) has infinite I-dimension. 

5 Conclusion 

In the paper a classification of infinite information systems is considered. The 
class of information systems is described which are near to the best from the 
point of view of deterministic and nondeterministic decision tree time and space 
complexity. This classification may be useful for the choice of information sys- 
tems for investigation of problems of pattern recognition, discrete optimization, 
computational geometry. 
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Abstract. Information systems (data tables) are often used to represent 
experimental data [4], [5], [8]. In [16] it has been pointed out that notions 
of extension and restriction of information system are crucial for solv- 
ing different class of problems [6],[7],[10]-[11],[12]-[15]. The intent of this 
paper is to present some properties of an information system extension 
(restriction) and methods of their verification. 



Keywords: information systems, indiscernibility relation, discernibility func- 
tion, minimal rules, information system extension. 

1 Introduction 

In the rough set theory, the information systems [5] and the rules extracted from 
information systems are the most common form of representing knowledge. In 
[16] it has been pointed out that notions of extension and restriction of informa- 
tion system are crucial for solving different class of problems, among others: (i) 
the synthesis problem of concurrent systems specified by information systems 
[10], [15], (ii) the problem of discovering concurrent data models from experi- 
mental tables [13], (iii) the re-engineering problem for cooperative information 
systems [12], [14], (iv) the real-time decision making problem [6], [11], (v) the 
control design problem for discrete event systems [7]. 

The main idea of an information system extension can be explained as follows: 
A given information system S defines an extension of S created by adding to 
S all new objects corresponding to known attribute values. If an extension of 
S is consistent with all rules true in S (i.e., any object u from a set of objects 
U matching the left hand side of the rule also matches its right hand side and 
there is an object u matching the left hand side of the rule) and is the largest 
(w.r.t. the number of objects in S) extension of S with that property then the 
system is called a maximal consistent extension of S. 

Maximal consistent extensions are used in the design of concurrent system 
specified by information systems [10], [15]. If an information system S specifies 
a concurrent system then the maximal consistent extension of S represents the 
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largest set of global states of the concurrent system consistent with all rules true 
in S. The set of global states can include new states of the given concurrent 
system consistent with all rules true in S. Moreover, if we are interested in 
determining a minimal (w.r.t. the number of objects) description of a given 
concurrent system then we can compute a minimal consistent restriction of an 
information system which specifies that concurrent system. 

The rest of the paper is structured as follows. A brief presentation of the 
basic concepts underlying the rough set theory is given in Section 2. The basic 
definitions and procedures for computing extensions are presented in Section 3. 

2 Preliminaries of Rough Set Theory 

2.1 Information Systems 

An information system is a pair S = A), where t/ - is a non-empty, finite set 

called the universe^ A - is a non-empty, finite set of attributes^ i.e., a : U ^ Va 
for a e A, where lA is called the value set of a. Elements of U are called objects. 
The set V = IJaeA domain of A. 

Example L Consider an information system S = {U^A) with U = {wi,U 2 ,W 3 }, 
A = {a, 6} and the values of the attributes are defined as in Table 1. 



U/A 


a 


b 


Ui 


0 


1 


U2 


1 


0 


Us 


0 


0 



Table 1. An example of an information system 

Let S = (ht, A) be an information system and let S C A. A binary relation 
ind(S), called an indiscernihility relation^ is defined by ind[B) = {(u, u^) G UxU 
for every a G B^a{u) = a{E)}, Any information system S = {U^A) determines 
an information function In f A : U P[AxV) defined hjInfA{u) = {(a,a(u)) : 

a G A} where ^ ^A^® denotes the powerset of A. The set 

{InfA{u) : w G C } is denoted by INF(S'). The values of an information function 
will be sometimes represented by vectors of the form (t^i , ^ for 
i = l,...,m where m = card(A). Such vectors are called information vectors 
(over V and A ). 

If A = (ht, A) is an information system then the descriptors of S are expres- 
sions of the form {a^v) where a G A and G TA- Instead of {a^v) we also write 
a = V or a^. If r is a Boolean combination of descriptors then by || r || we denote 
the meaning of r in the information system S. 

2.2 Rules in Information Systems 

Rules express some of the relationships between values of the attributes described 
in the information systems. 
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Let S = {U^A) be an information system and let V be the domain of A, 

A rule over A and V is any expression of the following form: (1) = 

Vi^ A ... A = Vi^ ^ where ap.Ui. e A, Vp, Vi- e Va,, for j = 1, ..., r. 

A rule of the form (1) is called trivial if = Vp appears also on the left 
hand side of the rule. The rule (1) is true m 5" if 0 A ... A = 

1 1 — 1 1 1 1 ' 

The fact that the rule (1) is true in S is denoted in the following way: = 

Vi^ A ... A = Vi^^Up = Vp, By D[S) we denote the set of all rules true in S , 
Let it C D[S). An information vector v = (vi, ..., is consistent with R 
iff for any rule A ... Aa^^ = vi^^Up = Vp m R if yi- = Vi- for j = 1, ..., r 

then Vp = Vp. 

2.3 Discernibility Matrix 

The discernibility matrix and the discernibility function [9] help to compute 
minimal forms of rules w.r.t. the number of attributes on the left hand side of 
the rules. 

Let S = {U^A) be an information system and let U = A = 

{ai, By M{S) we denote an n x n matrix (Qj), called the discernibility 

matrix of S', such that Cij = {a E A : a(u^) A for = 1, ...,n. 

A discernibility function fj\ 4 (S) for an information system S' is a Boolean 
function of m propositional variables (where ai E A for i = 1, ...,m) 

defined as the conjunction of all expressions \J c:*^ where \J c*^- is the disjunction 
of all elements of cA = {a* : a E Qj} for 1 < j < i <n and cij A 0- In the sequel 
we write a instead of a*. 

2.4 Minimal Rules in Information Systems 

Now we recall a method for generating the minimal (i.e., with minimal left hand 
sides) form of rules in information systems [10], [15]. The method is based on the 
idea of Boolean reasoning [1] applied to discernibility matrices defined in [9]. 

Let S = {U^A) be an information system and B C A. For every a ^ B we 
define a function d^ : U ^ ^(lA) such that d^{u) = {v EVa '• there exists E 
U v! ind{B) u and a{u') = v} where PiVa) denotes the powerset of TA. 

Let S = (t/. A) be an information system. We are looking for all minimal rules 
in S of the form: aq = A ... A aq = = v where a E A^v E uq G A 

and Vi- E Kq for j = 1, ...,r. 

The above rules express functional dependencies between the values of the 
attributes of S'. These rules are computed from systems of the form S^ = (L, B\J 
{a}) where B C A and a E A — B, 

First, for every v E Va^ui E U such that d^{ui) = {i;} a modification 
M{S'] a^v^ui) of the discernibility matrix is computed from M(S^). 

By a^v^ui) = (cA) (or AT, in short) we denote the matrix obtained from 

M(S^) in the following way: 

if i = / then cA = 0; 

if cij A 0 and d^{uj) A then = cq- n B 

else Cq = 0. 
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Next, we compute the discernibility function /m and the prime implicants 
[17] of /m taking into account the non-empty entries of the matrix M (when all 
entries are empty we assume /m to be always true) . 

Finally, every prime implicant A ... A of /m determines a rule = 
Vi^ A ... Aoi^= Vi^^a = V where ai.{ui) = Vi. for j = a{ui) = v. 

The set of all rules constructed in the above way for any a E A is denoted 
by OPT(N, a). We put OPT(N) = |J{ OPT(N, a) : a e A}. 

We compute all minimal rules true in S' = B U {a}) of the form r ^ 
a = where r is a term in disjunctive form over B and Vb = UacB with a 
minimal number of descriptors in any disjunct. To obtain all possible functional 
dependencies between the attribute values it is necessary to repeat this process 
for all possible values of a and for all remaining attributes from A. 

3 Extensions and Restrictions of Information Systems 

Let S = ([/^ A) be an information system. For S = A), a system S' = [U'^ A') 

such that U C U' ^ A' = {a' : a G A}, a'{u) = a{u) for u E U and W = W for 
a e A will be called an extension of N. N is then called a restrietion of S', 

The number of the extensions of a given information system determines 

Proposition 1. Let S =(U^A) he an information system^ k = card(U n = 
card(Va-^ x ... x ) where ai- G A for j = 1 ,...,/ and I = card[A). Then the 
number of extensions of S is equal to — F 

It follows from the following formula: + - + where Cj denotes 

the number of i-element combinations of a set with y-oloments. 

Let S = A) he an information system and let U" denotes the set of objects 
corresponding to all admissible global states of S which do not appear into S\ 
i.e., U" equals the difference between the cartesian product of the value sets for 
all attributes a E A and those from INF(N). We say that an information system 
S' = [U'^A) is a maximal extension of S iS U' = U U U" , 

Proposition 2. Let S ={U^ A) be an information system. There exists only one 
maximal extension S' of S , 

Let S = (F, A) be an information system. An information system S' = {U' ^A) 
is called a minimal restriction of S iff S' is a restriction of S and any restriction 
S" of S is an extension of S', 

Proposition 3. A given information system S has at least one a minimal re- 
striction S' of S (different from S), 

An implicant of a Boolean function / is any conjunction of literals (variables or 
their negations) such that if the values of these literals are true under an arbitrary 
valuation v of variables then the value of the function / under v is also true. A prime 
implicant is a minimal implicant. Here we are interested in implicants of monotone 
Boolean functions only, i.e. functions constructed without negation. 
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Example 2 , Consider the information system S from Example 1. The system 
S\ represented by Table 2 is its extension, but the system S 2 obtained from S\ 
by deleting the object is not. S\ is the maximal extension of S . However, 
the system obtained from S\ by deleting objects U 3 and U 4 is the minimal 
restriction of S , 



VMi 


a 


b 


Ui 


0 


1 


U 2 


1 


0 


Us 


0 


0 


U 4 


1 


1 



Table 2. The information system S\ 

3.1 Maximal Consistent Extensions of Information Systems 

The notion of a maximal consistent extension of a given information system has 
been introduced in [10]. In this subsection, we present some properties of this 
notion and procedures for computing the maximal consistent extension. 

Let 5^ = (t/^, H) be an extension of S = A). We say that is a consistent 

extension of S iff D[S) C is called a maximal consistent extension of 

S iff is a consistent extension of S and any consistent extension of S' is a 
restriction of S'b 

From the above definition follows 

Proposition 4. Let S =(U^A) he an information system. There exists only one 
maximal consistent extension S' of S. 

PROCEDURE for computing msiximal consistent extension S' of S: 

Input: An information system S = {U^A) and the set OPT(S') of all rules 
constructed as in subsection 2.4 for S. 

Output : The maximal consistent extension S' of S. 

Step 1. Compute all admissible global states of S which do not appear in S. 
Step 2. Verify (using the set OPT (S') of rules) which global states of S obtained 
in Step 1 are consistent with rules true in S'. 

It is known that, in general, the set OPT(S') of all rules constructed as described 
in subsection 2.4 can be exponential complexity (w.r.t. the number of attributes). 
Nevertheless, there are several methodologies allowing to deal with this problem 
in practical applications (see e.g. [2], [3], [4] pages 3-97). 

Proposition 5. Let S =(U^A) he an information system and S' its maximal 
consistent extension. The set OPT[S) of all minimal rules of S defined in sub- 
section 2.4 is empty iff S' is equal to the space of all possible values of attributes 
from A. 

In order to decide if a given information system has the maximal consistent 
extension the following proposition can be useful. 
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Proposition 6. Let S =(U^A) he an information system. If is the maximal 
consistent extension of S different from S then for at least two attributes a^h E A 
cardfVa) > 2 and cardfV}^) > 2. 

Proposition 6 constitutes a necessary condition for the existence of the maximal 
consistent extension of an information system different from that system. 

Proposition 7. Let S =(U^A) he an information system and its maxi- 
mal consistent extension. If there exists such combination of attribute values 
{vi ^ , ) where vi- G Va^ for j = 1 , 2 , ...,n and n = card[A) that there no 

exist a functional dependency between any two attribute values from this combi- 
nation in then is different from S. 

Proposition 8. Let S —(U^A) be an information system and S' its maximal 
consistent extension. S' is the information system in which all new added objects 
to S have the property mentioned in Proposition 1 . 

Example 3. Consider an information system S represented by Table 3. Applying 
to S' the method for generating the minimal form of rules described in subsection 
2.4 we obtain the following set OPT(S') of rules: ai V a 2 =^ 6 o, bi V 62 =^ao. 
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U4 
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0 



Table 3. An example of an information system S 

After running the procedure for computing maximal consistent extension of S 
we obtain the system S' including all objects of the system S and new object U 5 
such that a[u^) = b[u^) = 0 . 

PROCEDURE for finding a description of maximal consistent exten- 
sion S' of S: 

Input: An information system S =(C, A), the set OPT(S') of all minimal rules 
of S defined in subsection 2.4, and V - the domain of A. 

Output : A description of maximal consistent extension S' of S in the form of 
Boolean formula constructed from descriptors over A and V . 

Step 1. Rewrite each rule from OPT(A) to the form of Boolean formula. 

Step 2. Construct the conjunction of formulas obtained in Step 1. 

Step 3. Compute prime implicants of the formula obtained in Step 2. 

In order to find the description of all new elements of maximal consistent exten- 
sion S' of S (i.e., those from outside of the set of objects U) it is sufficient to 
execute a procedure presented below. 
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PROCEDURE for finding a description of all new elements of maximal 
consistent extension of S\ 

Input : As for the procedure presented above. 

Output : A description of new elements of maximal consistent extension S' of S 
in the form of Boolean formula constructed from descriptors over A and V . 

Step 1. Execute the procedure for finding a description of maximal consistent 
extension of S\ 

Step 2. Rewrite each row from the system S to the form of Boolean formula. 
Step 3. Construct the negation of the formula obtained in Step 2. 

Step 4. Construct the conjunction of formulas obtained in Steps 2 and 3. 

Step 5. Compute prime implicants of the formula obtained in Step 4. 

Example J^., Consider again the information system S and OPT (S') described in 
Example 3. Now by applying to S the procedure for finding a description of its 
maximal consistent extension we obtain the following Boolean formula: (“>((« = 
1) V (a = 2)) V (6 = 0)) A {->{{b = 1) V (6 = 2)) V (a = 0)). After simplifications 
(using Boolean theory laws) we get the formula of the form: (a = 0) V (6 = 0). 
This formula is matching to all objects of S' from Example 3. 

In order to find a description of new elements of maximal consistent extension 
of S' represented by Table 3 we can perform the above procedure. 

As result we obtain the following Boolean formula: “>(((a = 0) A (6 = 1)) V 
((a = 1) A(6 = 0)) V((a = 0) A(6 = 2)) V((a = 2) A(6 = 0))) A((a = 0) V(6 = 0)). 
After simplifications we get the result formula of the form: (a = 0) A (6 = 0). It 
is matching to the object us of the system S' from Example 3. 

Let S = (ht, A) be a restriction of S' = (E'^A). We say that S' is a consistent 
restriction of S iff U(S') C D[S'), S' is a minimal consistent restriction of S' iff 
S' is a consistent restriction of S' and any consistent restriction S" of S' is an 
extension of S'. 

Remark L The information system S from Example 3 is the minimal consistent 
restriction of the system S' considered in Example 3. 

4 Concluding Remarks 

The extensions (restrictions) of information systems appear in many investiga- 
tions related to the rough set methods for solving different class of problems [16]. 
The presented approach can be treated as a constructive method of the infor- 
mation system extension to the largest data table including the same knowledge 
as the original information system. Maximal (minimal) consistent extensions 
(restrictions) provide a basis for modeling concurrent systems using rough set 
methods. 
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Abstract. The concept of valued tolerance is introduced as an exten- 
sion of the usual concept of indiscernibility (which is a crisp equivalence 
relation) in rough sets theory. Some specific properties of the approach 
are discussed. Further on the problem of inducing rules is addressed. 
Properties of a “credibility degree” associated to each rule are analysed 
and its use in classification problems is discussed. 



1 Introduction 

Rough sets theory hes been introduced by Pawlak to deal with a vague de- 
scription of objects. The starting point of this theory is an observation that 
objects having the same description are indiscernible (similar) with respect to 
the available information. Although original rough sets theory has been used to 
face several problems, the use of the indiscernibility relation may be too rigid 
in some real situations. Therefore several generalisations of this theory have 
been proposed. Some of them ([1, 12]) extend the basic idea to a fuzzy context 
while others use more general similarity or tolerance relations instead of classical 
indiscernibility relation (see e.g. [8, 9]). There are also combinations of both ex- 
tensions by [2, 3] where lower and upper approximations are fuzzy sets based on 
a fuzzy similarity relation. Properties of extended binary relations were studied 
in [13]. 

In this paper we introduce the concept of valued tolerance relation as a new 
extension of rough sets theory. A functional extension of the concepts of upper 
and lower approximation is introduced so that to any subset of the universe a 
degree of lower (upper) approximability can be associated. In other terms, any 
subset of the universe can be lower (upper) approximation of a given set, but to 
a different degree. Further on, such a functional extension enables to compute a 
credibility degree for any rule induced from the input information table. Such an 
idea first appeared in our previous work on incomplete information tables [10]. 

The paper is organised as follows. In section 2, we discuss motivations for 
using valued tolerance relation. Then, in section 3 we introduce formally the 
concept of valued tolerance. Some specific properties of this approach are also 
examined. In section 4, problems of inducing decision rules and computing cred- 
ibility of rule are discussed. Results are summarised in section 5. 

W. Ziarko and Y. Yao (Eds.): RSCTC 2000, LNAI 2005, pp. 212-219, 2001. 

© Springer- Verlag Berlin Heidelberg 2001 
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2 Why Valued Tolerance? 

Original rough sets theory is based on the concept of indiscernibility relation 
which is a crisp equivalence relation (complete, reflexive, symmetric and transi- 
tive relation valued in {0, 1}). Practically speaking, two objects, described by a 
set of attributes, are indiscernible iff they have identical values. However, real life 
suggests that this is a very strong assumption. Objects may be practically indis- 
cernible without having identical values. The idea of substituting indiscernibility 
with the concept of similarity has already been studied in e.g. [9,8]. Moreover, 
it could be the case that objects can be “more or less similar” depending on the 
particular information available. Consider the following two examples. 
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Example 1. Three objects xi,X 2,^3 and four attributes ci, 02,03,04 are given, 
each attribute equipped with a discrete nominal scale A,B,C,D. Besides the above 
information table is provided, where * is representing the “unknown” value of 
attribute: 

If any similarity is to be considered among the three objects it is easy to sug- 
gest that “is more possible that X 2 is similar to xi than X3 to x\ \ Conventional 
rough sets theory simply does not apply in this case and its usual extensions 
handling unknown values will consider the three objects as completely different 
or totally identical ([4, 6]). However, being able to give a value to the possibility 
that objects are similar could open interesting operational directions (see [10]). 
Example 2. Three objects xi,X 2,^3 and four attributes ci, 02,03,04 are given, 
each attribute equipped with an interval scale in the interval [0,100]. Besides, 
the above information table is provided: 

If any similarity is to be considered the reader might agree that is reasonable 
to consider that “is more possible that X 2 is similar to than X3 to xi\ while 
X 2 and xs are not similar at all. This is an effect of the existence of a discrim- 
ination threshold. In such a model (see [7]) objects are different only if they 
have a difference of more than the established threshold (in the example such 
a threshold is 5). However, the threshold by itself does not solve the problem, 
for the same discrimination problem could be considered near the threshold: i.e. 
why a difference of 4 is not signiflcant and a difference of 5 it is? It is more 
natural to consider that the possibility that two objects are similar decreases as 
the difference of value of the two objects increases. The use of a valued tolerance 
appears to be more appropriate in this case also. 

From the above considerations it is clear (for us) the opportunity of intro- 
ducing a valued tolerance when comparing objects, whatever the purpose of the 
comparison is. Our aim is to introduce such a concept. In the following we will 
restrict ourselves to symmetric (possibly valued) similarity relations which we 
denote as (possibly valued) tolerance relations. 
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3 Rough Sets and Valued Tolerance 

In the following we introduce an approach already discussed for handling incom- 
plete information tables in [10]. Given a valued tolerance relation on the set A 
we can define a “tolerance class”, that is a fuzzy set with membership function 
the “possibility of tolerance” to a reference object x e A. The problem is to 
define the concepts of upper and lower approximation of a set The approach 
we will adopt in this paper considers, coherently with the rest, approximation 
as a continuous valuation. Given a set ^ to describe and a set Z C A we will try 
to define the degree by which Z approximates from the top or from the bottom 
the set Technically we will try to give the functional correspondent of the 
concepts of lower and upper approximation. Some researchers [1-3] have similar 
concerns and explored the idea of combining fuzzy and rough sets, but under 
our perspective lower and upper approximations are not fuzzy sets to which el- 
ements from the universe of discourse may more or less belong. Each subset of 
A may be a lower or upper approximation of but to different degrees (such 
an approach has been inspired by the work of Kitainik [5]). 

For this purpose we need to translate in a functional representation the usual 
logical connectives of negation, conjunction etc. (x,y,z* • • represent in the fol- 
lowing membership degrees). For this purpose we consider negation functions 
N[x) : N[x) = 1 — X, 7 -norms T[x^y) = min(x,y) or xy or max(x y — 1,0) 
and 7 -conorms 5(x, y) = max(x, y) or x y — xy or min(x T y, 1). If S[x^y) = 
N {T {N {x) ^ N {y))) we call the triplet (TV, T, 5) a De Morgan triplet. i(x,y), the 
degree by which x may imply y is again a function that could satisfy the following 
(almost) incompatible properties l{x^y) = S[N {x)^y) ov x < y <=> l{x^y) = 1. 

Coming back to our lower and upper approximations we know that, given a 
set Z C A, a subset of attributes B C C and a set the usual definitions are: 
Z ^ ^ z ^ Z.^ Ob{z) C ^ and Z = <=> \! z ^ Z.^ Ob{z) n 

4> ^ $ where Ob{z) is the tolerance class of element z created on the basis 
of the subset of attributes B. The functional translation of such definitions is: 
V X (/)(x) =def Tx(l){x); 3 X ^(x) =def Sx(l){x); 4> C ij/ =def 'i*(i(/i^(x),/i^(x))); 
<pn4^ ^ 0 =def 3 X (f)(x)A^ip(x) =def S'^('i’(/x#(x),/i^(x))) WO get: fad>B{Z) = 
'lzez{'J-xeeBiz){H^B{z,x),x))), fi^B{Z) = ^)^ ^))) 

where: idd>B{Z) ^^le degree for set Z to be a S- lower approximation of 
fi^B^Z) is the degree for set Z to be a i^-upper approximation of Ob{z) 
is the tolerance class of element z; T, S', / are the functions previously defined; 
as far as l[x^y) is concerned we will always choose to satisfy De Morgan law 
[I{x^y) = S[N{x)^y)). This is due to the particular case of ij 4 >b{Z) where x G 
{0, 1}. If we choose any other representation then lower approximability collapse 
to {0, 1}. Rb{z^x) is the membership degree of element x in the tolerance class 
of z (at the same time is the valued tolerance relation between elements x and z 
for attribute set B] in our case Rb{z^x) = /j^^itj(z, x)); x is the membership 
degree of element x in the set 4> [x E {0, 1}). 

In the following we provide some formal properties that such an approach 
fulfill. The reader should remark that in the following denotes the complement 
of set ^ with respect to the universe. 
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Proposition 1. IfT^S^I fulfill the De Morgan law and Rb is a valued tolerance 
relation then \TZ G A 

Proof: In order to demonstrate the proposition we observe that both the lower 
and the upper approximability are T-norms on the same set Z. It is sufficient 
now to demonstrate that Mz ^ Z the argument of the T-norm defining the 
lower approximability is less or equal to the argument of the T-norm defining 
the upper approximability. Thus we have to demonstrate that: — 

Rb{z^x)^x)) < Since min is the largest T-norm and 

max is the smallest T-conorm is sufficient to demonstrate that: min^^e^^^) (max(l — 
Rb{z^x)^x)) < max^^ 0 ^(; 3 )(min(it^ 5 (z, )). We distinguish two cases: 

1. Consider z = G Then x'k = 0. Therefore when x = Xk we have 
max(l — RB{xk:Xk)^Xk) = 0 so that min^^ 0 ^^^^(max(l — Rb{z^x)^x)) = 0 . 
At the same time: Vx G f = 0 and min{RB {z ^ x) ^ x) = 0 and Vx G ^ x = 

1 and mm[RB{z^x)^x) = Rb{z^x) so that max^^e^(; 3 ) (min (it!^ (z, x), x)) = 

max^^^ (it ^5 (z, x)) > 0. Therefore if z G we get /x^^(z) = 0 < 

2. Consider z = x/^ G Then Xk = 1. Therefore when x = Xk we have 

min(i{^(x/^, x/^), x"/^) = 1 so that max^^ 0 ^(^)(min(i{^ (z, x), x)) = 1. At the 
same time: Vx G ^ x = 1 and max(l — Rb{z^x)^x) = 1 and Vx G x = 0 
and max(l — Rb{z^x)^x) = 1 — Rb{z^x) < 1 so that min^^ 0 ^(^)(max(l — 
Rb{z^x)^x)) = min^^#c(l — it^(z,x)) < 1. Therefore if z G ^ we get 

h^niz) ^ 1 = We immediately obtain the following corollary. 

Corollary 1. 

If z e d> then /x^^(z) = min^^^c(l — Rb{z,x)) < = 1 

If z e then /i^^(z) = 0 < = max^^^ (z, x)) 

Proposition 2. //T, S', I respect the De Morgan law and Rb is a valued toler- 
ance relation then Vz //^^(z) = 1 — /i(^c)B(z). 

Proof: Denote by x^ the membership of x to Clearly x^ = 1 — x. We then 
have: - Rb{z,x),x)) = Be0Biz)XV - ^b{z,x),1 - 

x^)) = "iZeB(z)(l - T{Rb{z,x),x‘=)) = 1 - S^^^0^(^^j{T{Rb{z,x),x‘=)) = 1 - 
/i^^c^B(z). We immediately obtain the following corollary. 

Corollary 2. 

h^B{^) = 1 - S'^ez(/^(#-)s(z), / x^b(Z) = 1 - S'^gz(M(^-)b(^) 

Finally we can show the following result. 

Proposition 3. VZ C A B C B^p 4 >.{Z) < fi^^[Z) 

Proof: Since Rb{z^x) = Tj(^BDj{z^x)^ if C then Vx Rp[z^x) > Rb{z^x) 
and therefore Vx 1 — i?^(z,x) < 1 — it^^(z,x). Then by definition of lower 
approximability the proposition holds. 

The consequence of the above results is that in order to compute the lower 
(upper) approximability of any subset ^ C A is sufficient to compute the upper 
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(lower) approximability of each single element of A of the sets 4> and Oper- 
ationally we can fix a threshold k (/) for the lower (upper) approximability and 
then add elements to the empty set by decreasing order of their lower (upper) 
approximability. Consider the following example. 

Example 3. A set of 12 objects four attributes ci,C 2 ,cs,C 4 and a 

decision attribute d are given, each attribute equipped with an interval scale in 
the interval [0,100]. Besides, the following decision table is provided: 
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A constant threshold of A: = 10 applies in order to consider two objects as 
surely not different (similar) to any of the four attributes. Following an approach 
recently introduced [ 11 ] we consider for each attribute Cj a valued tolerance 
relation as follows: 



U _ max(0,min(cj(x),Cj(y)) + A; - max(ci(x),Cj(y))) 

^ 

where: k is the discrimination threshold. It is easy to observe that: G 

A Rj{x,y) = 1 iff Cj{x) = Cj{y) , Vx,y G A Rj{x,y) g]0, 1[ iff | Cj{x) - Cj{y) \< 
A:, G A Rj[x^y) = 0 lA \ Cj[x) — Cj[y) |> A: . If Rj[x^y) represents the 

necessity that x is similar to y then, in presence of several attributes, a way to 
evaluate the necessity that x is comprehensively similar to y is to take the T- 
norm of the different similarities. We get: R[x^y) = mmj[Rj{x^y)). Applying 
this formula to Example 3 we get the comprehensive valued relation on the set 
A. For instance, for object xi it(xi,xi)=l, it(xi, X2)=0.5, R[xi^xs)=0A and 
for the rest we have R[x — 1, y)=0. Using the above information we can compute 
the lower and upper approximability for each element of set A. For instance, 
/i#^(xi)=0.9 /x^b(xi) =1 /ii^s(:ri)=0 and /i^B(xi)=0.1. 



4 Decision Rules 

In order to induce classification rules from the decision table on hand we may ac- 
cept now rules with a “credibility degree” derived from the fact that objects may 
be similar to the conditional part of the rule only to a certain degree, besides the 
fact the implication in the decision part is also uncertain. More formally we give 
the following representation for a rule: pi =def = '^) ^ {d = (j>) 

where: B C v is the value of conditional attribute Cj, (j) is the value of deci- 
sion attribute d. We may use the valued relation s^(x,p^) in order to indicate 
that element x “supports” rule pi or that, x is similar to some extend to the 
conditional part of rule pi on attributes B. The relation s is a valued tolerance 
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relation defined exactly as relation R. We denote as S[pi) = {x : SB{x^pi) > 0} 
and <P = {x : d[x) = ^}. In a case of crisp relation pi is a classification 
rule iff: V X £ S [pi) : Ob[x) C Shifting in the valued case we can com- 
pute a credibility degree for any rule pi calculating the credibility for the pre- 
vious formula which can be rewritten as: V 

We get: ^Pi) = '^xeS{ppI{sB{x, Pi)XveeB{x){I{peB{x){y). P'l'{y)))))] where: 
PGB{x){y) =RB{x,y) and /i<^(y) G {0,1}. 

Finally it is necessary to check whether is a non-redundant set of conditions 
for rule i.e. to look if it is possible to satisfy the condition: 3 B C B : p[pf) > 
p[pf). We can equivalently state that if there is no B satisfying the condition 
then S is a “non redundant” set of attributes for rule pi. Before we continue 
the presentation of our approach in rule induction, is important to state the 
following result. 

Proposition 4. Consider a rule pi classifying objects to a set 4> C A under a set 
of attributes B. IfT^ S', I satisfy the De Morgan law and Rb is a valued tolerance^ 
the credibility p[pi) of the rule is upper bounded by the lower approxirnability of 
set $ by the element Xk whose description (under attributes B) coincides with 
the conditional part of the rule. 

Proof: Consider the definition of rule credibility. 

TyeeB(x){I{peB(x}{y),Pry))) = (a:). Considering that I(x,y) = S{N{x),y) 

and that sb = Rb we can rewrite: p[pi) = dj(zs(pp[S[l — Rb[^: pi): 

We distinguish four cases. 

1) It exists an element xj. E A whose description (under attributes B) coincides 

with the conditional part of rule pi. We have RB[x^pi) = 1. Therefore S'(l — 
Rb[^: pi):P<pB{^)) = (x) in this case. 

2) For all x for which RB[x^pi) = 0 we get S'(l — Rb[^: pi): P^b{^)) = 1* 

3) For all x for which 1 — RB[x^pi) > P 4 >b[^) we get 

S'(l — Rb[^: pi): P^b{^)) ^ 1 ~ hiB[x^pi) since max is the smallest T-conorm. 

4) For all x for which 1 — RB[x^pi) < P 4 >b[^) we get 

S'(l — Rb[^: pi): P^b{^)) — P^b{^) siiice max is the smallest T-conorm. 
Denoting Xk^xpXi^Xj the x for the four cases respectively we obtain: p[pi) = 
TxeS(pi){p<i>B{^k),Bxi : l},{Vxi : 1 - /9i)}, (Vx^ : p<i>B{xj)}) Since by 

definition T[x^y) < min(x,p) < x, p^^[xjf) is an upper bound for p[pi)^ 

Operationally, the user should fix a credibility threshold for the induced 
rules in order to prevent proliferation of rules considered as "unsafe” for the 
classification purposes. A sensitivity analysis could be performed around such a 
threshold to find accepted rules. 

In general, elementary conditions of the induced rules are created using the 
description of objects in the decision table. Assuming that the user has defined a 
credibility threshold at level A, it is possible to use the result of Proposition 4.1. 
to induce decision rules, i.e. when choosing objects as candidates for inducing a 
classification rules for class it is sufficient to choose only objects with lower 
approxirnability of a set ^ not worse than A. Other objects could be skipped; 
Further on it is necessary to search for the non-reduced sets of conditions (in 
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general it corresponds to the problem of looking for local redacts); Finally, given 
credibility threshold A one can relax the previous requirements to non-redundant 
rule, i.e. it may be accepted that from rule pi with credibility p{pi) new rules 
could be generated with shortest condition part but with lower credibility how- 
ever still over the allowed threshold. Let us notice that the problem of inducing 
all rules with accepted credibility from examples in the information table has 
exponential complexity in the worst case. However, fixing sufficient high value 
of credibility threshold may reduce the search space. 

Continuation of Example 3: Consider the information table of Example 3. 
We choose min as the T-norm and max as T-conorm. Let us consider creating 
a decision rule basing on description of the object xi, i.e. pi : (ci = 89 )a(c2 = 
91)a(c3 = 95)a(c 4 = 87)^(d = ^). Three objects {xi^X 2 ^x^) are similar to 
its condition part with Rc{xi^ pi)=l^ Rc{x 2 ^ pi)=0.^ and Rc{x^^ pi)=Q.l. So 
S{pi) = We can compute credibility of the rule according to for- 

mula: p{pi) = min^^ 5 (p.)(max(l — /x^^(x))). So, taking values of lower 
approximability we have p{pi) = min(max(l — l,0.9),max(l — 0.5, 0.6), max(l — 

0. 1.0.) = 0.6. However, this is a rule with a redundant set of conditions. As one 
can check, it can be reduced to the much simpler form pi : (c 4 = 87)^(d = ^) 
which is still supported by objects S[pi) = {xi,X 2 ,X 3 } with credibility degree 
= 0.6. Proceeding in a similar way we can induce other decision rules. 

Let us now consider the use of induced decision rules to classify new un- 
classified objects. The problem is to assign such objects to a-priori known sets 
(decision classes) on the basis of their tolerance to the conditional part of the 
already induced rules. We have a double source of uncertainty. First, the new 
object will be similar to a certain extend to the conditional part of a given rule. 
Second the rule itself has a credibility (classification is not completely sure any 
more). In general a new unclassified object will be more or less similar to more 
than one decision rule and such rules may indicate different decision classes. 
Therefore an unclassified object can be assigned to several different classes. In 
order to choose one class the following procedure is proposed: 

1. For each decision rules pi in the set of induced rules we calculate the tolerance 
of the new object z to its condition part, RB{^R^i)- 

2. Then we compute the membership degree of object z to each decision class 
^i as p<p-{z) = T[Rb{z ^ pi) ^ p[pi)) . Then we choose the class with the maximum 
membership degree. 

3. If a tie occurs (the same membership for different classes) choose the rule with 
the highest number of supporting objects S[pi). 

5 Conclusion 

In the paper we develop the idea that valued tolerance relations (symmetric 
valued similarity relations) can be more suitable when objects are compared 
for classification purposes. Particularly when rough sets are used, the classic 
indiscernibility relation (which is a crisp equivalence relation) can be a too strong 
assumption with respect to the available information. 
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The main contribution of the paper consists in considering that any subset of 
the universe of discourse can be considered as a lower (upper) approximation of 
set but to a different degree, due to the existence of a valued tolerance relation 
among the elements of the universe. A number of formal properties of this ap- 
proach are demonstrated and discussed in the paper. Further on, the availability 
of a lower (upper) approximability degree for each set with respect to a decision 
class 4 > enables to compute classification rules equipped with a credibility de- 
gree. A significant result obtained in the paper consists in demonstrating that a 
rule credibility is upper bounded by the lower approximability of the set whose 
elements description coincides with the conditional part of the rule. Finally, two 
sources of uncertainty during classification of new objects were discussed. 



References 

1. Dubois D., Prade H.: Rough Fuzzy Sets and Fuzzy Rough Sets. International Jour- 
nal of General Systems, 17 , (1990), 191-209. 

2. Greco S., Matarazzo B. Slowinski R.: Fuzzy similarity relation as a basis for rough 
approximations. In Polkowski L., Skowron A. (eds.), Proc. of the RSGTG-98, 
Springer Verlag, Berlin, LNAI 1424, (1998), 283-289. 

3. Greco S., Matarazzo B. Slowinski R.: Rough set processing of vague information 
using fuzzy similarity relations. In Galude G.S., Paun G. (eds), Finite vs infinite: 
contributions to an eternal dilemma, Springer Verlag, Berlin, (2000), 149-173. 

4. Grzymala-Busse J.W.: On the unknown attribute values in learning from examples. 
Proc. of Int. Symp. on Methodologies for Intelligent Systems, (1991), 368-377. 

5. Kitainik L.: Fuzzy Decision Procedures with Binary Relations, Kluwer Academic, 
Dordrecht, (1993). 

6. Kryszkiewicz M.: Properties of incomplete information systems in the framework 
of rough sets. In Polkowski L., Skowron A. (eds.). Rough Sets in Data Mining and 
Knowledge Discovery, Physica- Verlag, Heidelberg, (1998), 422-450. 

7. Luce R.D.: Semiorders and a theory of utility discrimination, Econometrica, 24 , 
(1956), 178-191. 

8. Skowron A., Stepaniuk J.: Tolerance approximation spaces, Fundamenta Informat- 
icae, 27 , (1996), 245-253. 

9. Slowinski R., Vanderpooten D.: Similarity relation as a basis for rough approxi- 
mations, In Wang P. (ed.). Advances in Machine Intelligence and Soft Gomputing, 
vol. IV., Duke University Press, (1997), 17-33. 

10. Stefanowski J., Tsoukias A.: On the extension of rough sets under incomplete 
information, in N. Zhong, A. Skowron, S. Ohsuga, (eds.). New Directions in Rough 
Sets, Data Mining and Granular-Soft Gomputing, Springer Verlag, LNAI 1711, 
Berlin, (1999), 73-81. 

11. Tsoukias A., Vincke Ph.: A characterization of PQI interval orders, to appear in 
Discrete Applied Mathematics^ (2000). 

12. Yao Y.: Gombination of rough sets and fuzzy sets based on a-level sets. In Lin 
T.Y., Gercone N. (eds.). Rough sets and data mining, Kluwer Academic, Dordrecht, 
(1996), 301-321. 

13. Yao Y., Wang T.: On rough relations: an alternative fromulation. In N. Zhong, 
A. Skowron, S. Ohsuga, (eds.). New Directions in Rough Sets, Data Mining and 
Granular-Soft Gomputing, Springer Verlag, LNAI 1711, Berlin, (1999), 82-90. 




A Conceptual View of Knowledge Bases 
in Rough Set Theory 



Karl Erich Wolff 

University of Applied Seienees Darmstadt, Department of Mathematies 
Sehofferstr. 3, D-64295 Darmstadt, Germany 
ErnstSchroderCenter For Conceptual Knowledge Processing 
Researeh Group Coneept Analysis at Darmstadt University of Teehnology 
E-mail : wolf f ©mathematik . tu-darmstadt . de 



Abstract. Basie relationships between Rough Set Theory (RST) and Formal 
Coneept Analysis (FCA) are diseussed. Differenees between the "partition ori- 
ented" RST and the "order oriented" FCA eoneerning the possibility of knowl- 
edge representation are investigated. The fundamental eonneetion between RST 
and FCA is that the knowledge bases of RST and the sealed many- valued eon- 
texts of FCA are shown to be nearly equivalent. 



1 Introduction: Rough Sets and Formal Concepts 

The purpose of this paper is to discuss some basic relationships between Rough Set 
Theory (RST) and Formal Concept Analysis (FCA). This discussion is a part of the 
more general investigation of the purposes, the underlying philosophical ideas, the 
formalizations, and the technical tools of knowledge theories as, for example, data 
analysis, data base theory, evidence theory, formal languages, automata theory, sys- 
tem theory, and Fuzzy Theory. Their relations to classical knowledge theories like 
geometry, algebra, logic, statistics, probability theory and physics also should be 
studied. 

Both RST and FCA were introduced, independently, in 1982, by Z. Pawlak [7]and R. 
Wille [lOJrespectively. Roughly speaking, both theories formalize in some meaning- 
ful way the concept of "concept"; more careful explanation is given in section 2. Both 
theories are used to investigate parts of "reality" described by some "measurement 
protocol" from which "data" are obtained, technically represented mainly in data 
tables. Both theories use data tables as the central tool for the development of deci- 
sion aids. Both theories have been widely applied in science and also in industry. But 
until now only a few personal contacts between the RST and FCA communities have 
occured. This paper should open up detailed discussion. 

Despite their many similarities there are several differences between the two commu- 
nities. I believe that the underlying philosophical ideas are quite different. That seems 
to influence strongly the research aims and the strategies in research development. 
That should be discussed through personal interaction where the differences between 
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the formalizations of basic concepts in the theories can be discussed more clearly and 
easily. 



2 Formal Concept Analysis 

Formal Concept Analysis (FCA) was introduced by Wille [10] and developed in the 
research group Concept Analysis of the mathematical department at Darmstadt Uni- 
versity of Technology (Germany). It is now the mathematical basis for research in 
Conceptual Knowledge Processing. For the mathematical foundations the interested 
reader is referred to Ganter, Wille [4,5]. Two elementary introductions were written 
by Wolff [13] and Wille [12]. From the latter we quote: 

Formal Concept Analysis is based on the philosophical understanding that a concept 
is constituted by two parts: its extension which consists of all objects belonging to the 
concept, and its intension which comprises all attributes shared by those objects. For 
formalizing this understanding it is necessary to specify the objects and attributes 
which shall be considered in fulfdling a certain task. Therefore, Formal Concept 
Analysis starts with the definition of a formal context .... 



2.1 Formal Contexts and Concept Lattices 

The following definition of a formal context was motivated by the observation that 
the specific meaning of concepts in human thinking and communication is always 
determined by contexts. For the description of contexts we use the most simple verbal 
utterance which states that an object has an attribute. Therefore a formal context is 
defined as a triple (G,M,I) of sets where I is a binary relation between G and M, i.e., 
I e GxM. The set G is called the set of objects (Gegenstande), the set M is called the 
set of attributes (Merkmale) and the statement that the pair (g,m) g I is read "object 
g has attribute m". 

Why is this simple definition important for a formal theory on concepts? It is impor- 
tant since it is possible to define for each formal context K a very meaningful concep- 
tual hierarchy (B(K),<), whose elements, the formal concepts of K, represent units of 
thought consisting of two parts, the extension and the intension, just as it is under- 
stood in philosophical investigations dating back to Amauld, Nicole ([1],1685). 

A formal concept of K = (G,M I) is defined as a pair (A,B) where A e G, B e M and 
A - B and B^ = A where A is the set of common attributes of A, formally de- 
scribed as A^ := {m G M I VgG A g I m } and B^ is the set of common objects of B, 
B^ := {g G G I VmGB g I m }. A is called the extent and B the intent of (A,B). 

The set of all formal concepts of K is denoted by B(K). The conceptual hierarchy 
among concepts is defined by set inclusion: For (A , Bj ), (A 2 , B 2 ) g B(K) let (A^ , 
BJ<(A2,B2) : <^ Aj e A 2 (which is equivalent to B 2 e B^ ). 

An important role is played by the object concepts y(g) := ({g}^^ , {g}^ ) for g g G 
and dually the attribute concepts iLi(m) := ({m}^ , {m}^ ) for m g M. 
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The pair (B(K),<) is an ordered set, i.e., < is reflexive, antisymmetric, and transitive 
on B(K). It has some important properties: 

(B(K),<) is a complete lattice, called the concept lattice ofK, and any complete lattice 
is isomorphic to a concept lattice; (B(K),<) contains the entire information in K, i.e., 
K can be reconstructed from B(K). If B(K) is finite it can be drawn as a line diagram 
in the plane, such that K can be reconstructed. 

Line diagrams of concept lattices can be drawn automatically by computer programs 
(Wille [11]) and serve as an important communication tool for the representation of 
multidimensional data (Wolff [14]). 

It is clear that binary relations and therefore formal contexts are used in nearly all 
branches of mathematics and in many applications; therefore Formal Concept Analy- 
sis is very useful in many situations, even if the formal contexts are not finite. One of 
the most famous infinite examples is the context (Q, Q, <q) of the rational numbers Q 
with the usual rational ordering <^. The concept lattice B(Q, Q, <^) is isomorphic to 
the complete lattice of all real numbers including oo and -oo with the usual ordering on 
this set. This conceptual construction of the real numbers shows that Formal Concept 
Analysis covers not only finite structures. 

Since each complete lattice is isomorphic to a concept lattice, and complete lattices, 
closure systems, and closure operators are mathematically equivalent. Formal Con- 
cept Analysis enriches the application of these theories by a strong communicational 
component, which stems from the contextual meaning of the objects and attributes 
and the rich possibilities for visualizing multidimensional data by line diagrams of 
concept lattices. 



2.2 Conceptual Scaling 

The word ’scaling’ is understood here in the sense of ’embedding something in a cer- 
tain (usually well-known) structure’, called a scale: for example, embedding some 
objects according to the values of measurements of their temperature into a tempera- 
ture scale. Another example is the embedding of conference talks in the time sched- 
ule, which is usually a direct product of two time chains, one for hours and one for 
days. More generally, in conceptual scaling objects or values are embedded in the 
concept lattice of some formal context, called a conceptual scale. 

Conceptual Scaling Theory was developed by Ganter and Wille [3]. The general 
process in conceptual scaling starts with the representation of knowledge in a data 
table with arbitrary values and possibly missing values. These data tables are formally 
described by many-valued contexts (G,M, W,I), where G is a set of ’objects’, M is a set 
of ’many-valued attributes’, W is a set of ’values’ and I is a ternary relation, I e 
GxMxW, such that for any g e G, me M there is at most one value w satisfying 
(g,m,w) e I. Therefore, a many-valued attribute m can be understood as a (partial) 
function and we write m(g) = w iff (g,m,w) e I. A many-valued attribute m is called 
complete if it is a function. (G,M,W,I) is called complete if each m e M is complete. 
The central granularity-choosing process in conceptual scaling theory is the 
construction of a formal context S = (W , M , I ) for each me M such that W 

m V m’ m" mx m 
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3 mG := {m(g) | ge G } . Such formal contexts, called conceptual scales, represent a 
contextual language about the set of values of m. Usually one chooses as the set 
of all ’possible’ values of m with respect to some purpose. Each attribute ne is 
called a scale attribute. The set n^ = {w | w n } is the extent of the attribute concept 
of n in the scale S^. Hence, the choice of a scale induces a selection of subsets of 
- describing the granularity of the contextual language about the possible values. The 
set of all intersections of these subsets constitutes just the closure system of all extents 
of the concept lattice of S^. 

The granularity of the language about the possible values of m induces in a natural 
way a granularity on the set G of objects of the given many- valued context, since 
each object g is mapped via m onto its value m(g) and m(g) is mapped via the object 
concept mapping of onto y„(m(g)): g ^ m(g) ^ yjm(g)). 

Hence the set of all object concepts of plays the role of a frame within which each 
object of G can be embedded. 

For two attributes m, m' e M each object g is mapped onto the corresponding pair: 
g ^ (m(g), m'(g)) ^ ( yjm(g)), y„,(m'(g)) ) e B(S„ ) x B(S„^ ). 

The standard scaling procedure, called plain scaling, constructs from a scaled many- 
valued context ((G,M,W,I), (S^ | m e M)), consisting of a many-valued context 
(G,M,W,I) and a scale family (S^ | m g M), the derived context, denoted by 
K := (G, {(m,n) | m g M, n g }, J), where 

g J (m,n) iff m(g) n (g g G, m g M, n g ). 

The concept lattice B(K) can be (supremum-)embedded into the direct product of the 
concept lattices of the scales (Wille [10], Ganter, Wille [4,5]). That leads to a very 
useful visualization of multidimensional data in so-called nested line diagrams, which 
is implemented in the program TOSCANA (Vogt, Wille [9]). 

Scaled many-valued contexts are essentially the same as information channels in the 
sense of (Barwise, Seligman [2]), which was shown by the author (Wolff [18]). 

Finally we mention that Fuzzy Theory, introduced by Zadeh [20], also developed 
some notion of a scale, namely the linguistic variables (Zadeh [21]). It was shown by 
Wolff ([15,16]) that Fuzzy Theory can be extended (by replacing the unit interval in 
the definition of the membership function by an arbitrary ordered set (F,<)) to so- 
called F-Fuzzy Theory, which allows for developing, analogously as with Formal 
Concept Analysis, a Fuzzy Scaling Theory which is equivalent to Conceptual Scaling 
Theory. 



3 Looking at Rough Sets Conceptually 

In this paper I do not repeat the basic notions in Rough Set Theory systematically. I 
just describe some important differences between RST and FCA. 
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3.1 Partitioning or Ordering the Universe ? 

Pawlak ([8], p.2) starts his definition of a knowledge base with "a finite set U 0 
(the universe) of objects we are interested in." The set U plays the same role in RST 
as the set G of objects (which may be infinite or empty) in FCA. Furthermore, Pawlak 
writes "Any subset X e U of the universe will be called a concept or a category in 
U...". This notion of "concept" corresponds extensionally to the notion of "extent of 
an attribute concept" and intensionally to "attribute" in FCA, which is also indicated 
by the name "category". Therefore the classical distinction between extents and in- 
tents, which goes back to the "Port Royal Logic" of Amauld and Nicole ([1], 1685) is 
not represented in the "extensionally oriented" descriptions in RST. 

It is remarkable that Pawlak does not introduce the notion of an arbitrary system of 
"overlapping" subsets of U, defined as a pair (U, S), where S is a subset of the power 
set P(U). The reason is that Pawlak ([8], p.3) "is mainly interested in this book with 
concepts which form a partition (classification) of a certain universe U...". 

This decision, to use partitions of the universe, is an expression of a certain (often 
successful) mode of thinking in nominal structures. Therefore the set inclusion does 
not play the same prominent role in RST as in FCA, where the conceptual hierarchy 
is defined by the inclusion of the extents (or, equivalently, the inverse inclusion of the 
intents) of formal concepts. The background for this "ordinal thinking" in FCA is the 
successful application of ordinal structures, used for example in the conceptual think- 
ing of Aristotle, in the subspace structures in spatial geometry and in lattices in logics. 
At first glance, the nominal approach in RST and the ordinal approach in FCA seem 
to be very different and incomparable. But each partition, for example the partition 
{M,F} of the universe of all people with the class M of male and the class F of female 
people, can be described by a formal context, for example (M u F, {M,F}, e). The 
formal context of a partition is a nominal scale, defined as a formal context (G,M,I) 
where the relation I is a function from G to M. The concept lattice of a nominal scale 
is isomorphic to an antichain together with a top and a bottom element. 

On the other side, from any formal context K = (G,M,I) one can obtain a partition of 
the object set G by taking the inverse images of the object-concept mapping y : G ^ 
B(K). This partition p(y) := {y V(g) I g ^ G } is called the partition of j . The classes 
of p(y) are just the equivalence classes of the relation - , where g - h is defined by g^ 
= h (for g, h e G). That means, that g and h are indiscernible in the sense, that g and 
h have exactly the same attributes in K. The classes of p(y) are called the contingents 
of y (or the object contingents of K). 



3.2 Knowledge Bases and Scaled Many-Valued Contexts 

Pawlak ([8], p.3) defines a knowledge base as a pair (U,R) where R is a family of 
equivalence relations on the finite, non-empty set U, the universe of the knowledge 
base. The FCA conceptual counterpart of a knowledge base (U,R) is a scaled many- 
valued context ((G,M,W,I), (S^ | m g M)). In the following we show how to construct 
a suitable scaled many-valued context from a knowledge base. The main idea is to 
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take the equivalence relations R e R as many- valued attributes and the equivalence 
class [x]^ as the value of R at the object x e U. Then nominal scaling of the equiva- 
lence classes yields a derived context which has as contingents just the indiscemibil- 
ity classes of (U,R), where the indiscemibility relation IND(R) is the intersection of 
all equivalence relations in R. The formal description is given in Theorem 1 . 

Theorem 1: 

Let (U,R) be a knowledge base. Then the scaled many-valued context sc(U,R) := 
((U,R,W,I), (S J R e R)) is defined by: W:= { [x]^ | x e U, R e R} and (x,R,w) e I 
:<^ w = [x]^ and the nominal scale := (U/R, U/R, =) for each many-valued attrib- 
ute R e R. Then the indiscemibility classes of (U,R) are exactly the contingents of 
the derived context K of sc(U,R). 

Proof: The derived context of sc(U,R) is by definition K = (U, {(R, [x] J | R e R, 
[x\ e U/R }, J ), where the relation J is defined by x J (R, [y\) :<^ [x\ = [y]j^ (for 
X, y e U and R g R). Let y denote the object-concept mapping of K. Then we have to 
prove that for all x, y g U : y(x) = y(y) (VRg R (x,y) g R). Let x, y g U, then 
(VRg R (x,y) G R) « (VRg R [x], = [y], ) « (VRg R x J (R, [y] J) « y(x) = y(y). 

The constmction of a knowledge base from a scaled many-valued context is de- 
scribed in Theorem 2. 

Theorem 2: 

Let SC:= ((G,M,W,I), (S^ | m g M)) be a scaled many-valued context, and K := (G, 
{(m,n) I m G M, n g }, J) its derived context. Then the knowledge base kb(SC) is 
defined by kb(SC):= (G, R), where R := {R^ | m g M} and for m g M R^ := {(g,h)G 
GxG I y^(g) = y^(h) } and y^ is the object-concept mapping of the m-part of K; 
clearly, the m-part of K is the formal context (G, {(m,n) | n g }, JJ where 
{(g, (m,n)) G J I n G }. Then the indiscemibility classes of kb(SC) are exactly the 
contingents of the derived context K of SC. 

Proof: Let g, h g G, then the object-concept mapping y of K satisfies: y(g) = y(h) 
VmGM yjg) = yjh) VmG M (g,h) g R^ (g,h) g IND(R). 

If we combine the two operations sc and kb we conclude that the nominal scaling in 
the definition of sc is "compatible" with the operation kb in the sense of the following 
theorem: 



Theorem 3: 

For any knowledge base (U, R): kb(sc(U, R)) = (U, R). 



Proof: Theorem 3 follows from Theorem 1 and Theorem 2. 
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Clearly, for a scaled many- valued context SC:= ((G,M,W,I), (S^ | m e M)) we can 
not expect that sc(kb(SC)) equals SC. The reason is that the ordinal structures of the 
concept lattices of the m-parts of the derived context K of SC are not represented in 
the knowledge base kb(SC). 

This investigation uncovers another aspect of RST, which is in a way not evident 
from the viewpoint of theory construction: Pawlak ([8], p.5) introduces basic catego- 
ries as "set theoretical intersections of elementary categories", where elementary 
categories are the classes of the equivalence relations in (U, R). The set of all basic 
categories of (U, R) is exactly the closure system of the extents of the derived context 
K of sc(U, R). That the basic category "red and triangular" describes a subconcept (in 
RST and in FCA) of the basic category "red" shows that the ordinal structure of the 
conceptual hierarchy is used, expressed with the lattice operation "infimum", which 
corresponds to the intersection of extents according to the Basic Theorem on Concept 
Lattices (Ganter, Wille [5], p.20). The same theorem states that the supremum in the 
concept lattice corresponds not necessarily to the union of the extents, but to the 
closure of the union of the extents. Therefore, the construction of all unions of basic 
categories leads to a usually much larger lattice than the corresponding concept 
lattice. 

The above-studied relationship between scaled many- valued contexts and knowledge 
bases should be seen also in connection with information channels, which are 
essentially the same as scaled many- valued contexts (Wolff [18]). 



4 Conclusion and Future Perspectives 

This short comparison between some fundamental notions in RST and FCA can be 
extended further to discuss, for example, the role of granularity in RST, Fuzzy The- 
ory, and FCA on the basis of "L-Fuzzy Scaling Theory" being equivalent to Concep- 
tual Scaling Theory (Wolff [15,16). Another interesting field is the reduction of 
knowledge and the conceptual role of reducts. For practical purposes several kinds of 
dependencies should be carefully compared. Numerous applications where the set of 
variables of a data table is divided into two (or more) parts of independent and de- 
pendent variables, of input and output variables, of time and space variables can be 
described using the decomposition and embedding techniques in FCA. Examples are 
decision tables, switching circuits, automata and the representation of systems in 
Mathematical System Theory (Mesarovic, Takahara [6]). The role of "time" in the 
formal description of systems was conceptually investigated by the author (Wolff 
[17]); conceptual time systems, consisting of a many-valued context with an object 
set of "time granules", a "time part" and a "space part" lead to a conceptual system 
theory in which state and phase spaces are introduced as concept lattices. This ap- 
proach can be used for a conceptual description of arbitrary automata, including an 
interpretation of the edges of the directed graph of the automaton as implications of a 
suitable formal context. 
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Abstract. This paper presents a Laboratory implementation of a com- 
puter vision using fuzzy logic techniques for the detection of boundaries 
in images obtained through a camera installed in a robotic arm. Here, 
these images were captured, digitized and analysed using fuzzy algo- 
rithms, in order to used for the control of the robotic arm carrying the 
camera. The image treatment process uses the method proposed by Pal 
and Majumder (1987), which employs image enhancement techniques 
through the use of fuzzy set concepts. The image recognition process 
generates the control signal necessary to move the robotic arm in a given 
specific pathway, as well as to determine the next action to be taken at 
the end of the task. 



1 Introduction 

The artificial Vision System tecnology is increasingly becoming more important. 
Its aplications may be found in several areas, such as in industries, in medicine, in 
robots, etc. In the case of robots the vision is indispensable. The vision systems 
for use with robots should be designed following to basic criterion (Groover, 
Weiss, Nagel and Odrey, 1989): 

1®^ - Relatively Low Cost; 

2 ^^ - Relatively East Response Time. 

Usually there are two methods for the implementation of vision systems. One 
takes the image of the object and tries to reduce it to contour lines that form the 
profile of that object. This method utilises filters to obtain the image information 
and uses contrast enhancement to turn all parts of the image into either black 
or white (thresholding). The formed image is usually called a binsiry image. 



W. Ziarko and Y. Yao (Eds.): RSCTC 2000, LNAI 2005, pp. 229-237, 2001. 
@ Springer- Verlag Berlin Heidelberg 2001 




230 A. Breunig et al. 



The advantage of binary images is that they supply well defined limits, easily 
recognized by simple algorithms. This type of implementation is used by systems 
for the processing of two dimensional images. 

The other method for the implementation of vision systems tries to give 
to the computer an image closer to what humans perceive. This method gives 
information about the brightness of the image to the computer. This allows the 
computer to obtain two important image caracteristics, which are not possible 
to obtain with the technique of high image contrast: surfaces and shadows. This 
can be used to obtain three dimensional image information and to solve confiicts 
when one object partially blocks an other. This method is usually used for three 
dimensional vision systems. 

In this work we shall use the first method, as we will have a controlled working 
environment, utilizing images with suficient clear shapes. 



2 Vision System 

The images is obtained with a camera connected straight to the parallel port 
of a computer. The image is digitized with a resolution of 64 x 64 pixels and 
256 levels of grey. The image will be analyzed through the techniques of Image 
Enhancement and Histogram, using for this purpose. Fuzzy Algorithms. The 
image analysis will generate a control signal for the movement of the six joints 
robotic arm Fig. 1. 




Fig. 1. General scheme of the robot system 



3 Pre-processing 

The analogical video signal obtained with the camera is sampled and quantized 
in order to be suitable for computer processing. 
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3.1 Image Representation - Matrix Form 

A monocromatic image may be described as a mathematical function f(x,y) of 
the light intensity, where its value at a point (x, y) is proportional to the level 
of grey at that point. 

This function f(x,y), Eq. 1, is represented by a M x N matrix with levels 
of grey, each one representing a pixel. 



3.2 Image Representation - Fuzzy Set Theory 

A image F, where F = f(x, y) (Eq. 2), with dimensions M x N and levels of 
grey may be considered as a matrix of singletons fuzzy, each one with an f(x,y) 
value that indicates the relative brightness value of grey level where 1 = 
0, 1, 2, 3...L — 1. Applying the fuzzy set theory we may write (Pal and Majumder, 
1987): 



or. 



f\x,y) = 



F = 



in union form 

X = 



/(0,0) 
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(Eq. 2): 
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(2) 



where Rmn/^mn? (d ^ Mmn ^ l) represents the grade of possessing some property 
Rmn by the (m,n)th pixel Xnm- This fuzzy property /imn may be defined in a 
number of ways with respect to any brightness level depending on the problem 
at hand. 



3.3 Digitizing 

The signal acquired through the camera must be sampled and quantized in order 
to be suitable for computer processing. An increase in the valuess of M and N in 
(2) means a better image resolution. Quantization rounds up the values of each 
pixel and places, them in the range of 0 a 2^^ — 1, where the larger the value of 
n the greater the number of grey levels present in the digitized image. 
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The digitized signal is a converted analogic to digital signal, in which the 
number of samples per unit of time gives the sampling rate, and the number of 
bits of the analogical/digital converter determines the number of levels of grey. 

It should be decided which values of M, N and I are suitable, talking into 
account image quality and the number of bits necessary for storage. We may 
calcule the quantity of bytes utilized by means of the expression: N x M x //8 
(Marques and Vieira, 1999). It should be remember that image quality is a 
subjective concept and it is strougally dependent on the aplication itself. 

4 Image Treatment 

The captured image is made up of two parts: the object to be identified and the 
background. In order to separate these two parts it is necessary to have a good 
contrast, that is, a significant difference in light intensity between points of the 
object and point of the background must exist. 

Image treatment is based on the principal of thresholding (Gonzalez, e 
Woods, 1992), that is, to assign the value 0 for pixels above a given value X 
(determined by the Histogram analises), and a value 1 for pixels below or at 
most equal to X. The image enhancement algorithm utilized in this work was 
suggested by Pal e Majumder (1987), and show in Fig. 2. This technique mod- 
ifies the pixels using the properties of the fuzzy set. This procedure involves a 
preliminary image enhancement in block “E”, followed by a smoothing in block 
“S” and then a further enhancement. The fuzzy operator INT (contrast inten- 
sification) is used in both enhancements. The pupose of the image smoothing is 
to blur the image before the second enhancement. 




Fig. 2. Block diagram of the enhancement model 



The algorithm used in the image enhancement (Fig. 2) is explained in the 
pages that follow. 



4.1 Histogram 

The histogram gives na idea of the image quality in terms of contrast and relative 
proportions of white and black in the image. 

The histogram of an image is usually a grafic representation of bars, which 
indicates the I levels of greys and the quantity of pixels in the image. The vertical 
axis represents the amount ly of pixels for a given I levels of grey. The horizontal 
axis represents the I levels of grey. 
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Here, the histogram is used to give na inicial idea of the image quality for 
future comparison with the image obtained after blocks “E” and “S”. These will 
be explained in sub-section 4.2 and 4.3. 



4.2 Image Enhancement through the Concepts of Fuzzy Sets 



Image enhancement is acomplished through the INT operator. The operation 
of contrast intensification (INT) upon a fuzzy set A criates another fuzzy set 
= INT(A), where the membership function, is given by: 



MA'(x) — RinT(A)(x) 



2[/iA(x)]^0 < /xa(x) < 0.5 (3a) 

[1 - 2(1 - /iA(x))]^0.5 < yUA(x) < 1 (3b) 



This operation reduces the fuzziness of a set A by increasing the values of 
/xa(x) which are above 0.5 and decreasing those which are below it. Let us now 
define operation (3) by transformation Ti of the membership function /r(x) (Pal 
e King, 1981). In general, each /imn in F (Eq. 2) may be modified to to 
enhance the image F in the property domain by a transformation function T^, 
where: 



f T((yU.mn),0 < yUmn < 0.5 (4a) 

\y(Mmn)V-5 < Mmn < 1 (4b) 



and, r = 1, 2, .... 

The transformation function T^ is defined as successive applications of Ti by 
the recursive relationship: 

V(/imn) = Ti{Ts-l(/imn)},S = 1, 2, • • • (5) 

and Ti(/inin) represents the operator INT defined in Eq.3. This is shown that, as 
r increases, the curve tends to be steeper because of the successive application 
of INT. In the limiting case, as r ^ oo, 4V produces a two- level (binary) image. 



4.3 Smoothing Algorithm - Averaging 

This method is based on averaging the intensities within neighbors and is usually 
used to remove “pepper and salt” noise. The smoothed (Block “S” - Fig. 2) 
(m,n)th pixel intensity is: 

xA = ^ (in,n),(i,j) € Qi (6) 

Qi 

The higher the values of Qi the greater is the degree of blurring. 

The smoothing algorithm described above blur the image by attenuating 
the high spatial frequency components associated with edges and other abrupt 
changes in grey levels. 
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5 Hardware System 

In this section, the hardware system, developed to implement the control strategy 
presented in this paper, is described. There are six DC motors to drive the joints 
and six potentiometers to measure the joint angles. 

The driving circuit includes high frequency transistors, switched by pulse 
width (PWM). The end effector is actuated using a pneumatic transmitter. The 
three base joints are free to turn up to 270°, and the upper three joints can only 
rotate 180°. 

To test the strategies presented in this paper, a specific circuit had to be 
developed, since the manufacturer’s driving circuit does not allow for most of 
the requirements of the approach developed in this research project. Therefore, 
only the original mechanism, the motors, and the potentiometers are used in this 
work. All the electronic circuits and software programs have been developed and 
implemented. Three Microchip^^ PIC micro controllers are used (Microchip, 
1997). 

They have been chosen because they are cheap and relatively ease to imple- 
ment. They use RISC technology of 14 bits. Its possible to deal with a large 
number of interruption schemes and to generate PWM signals independently of 
the CPU synchronization command. 

Figure 1 presents a general scheme of the robot system used in this work. 
Basically, the system consists of two hierarchical levels. In the highest level there 
is a Pentium computer working as a host. In the second level, there are three 
micro controllers of the type PIC16C73A. 

6 The Fuzzy Controller 

The use of a fuzzy controller for trajectory tracking of such a robot arm is very 
appropriate, considering the nonlinear behavior of the system. There are some 
propositions for the application of fuzzy logic to robot control (Luh, 1983). The 
main reason for the application of fuzzy control for this kind of application is, ba- 
sically, the nonlinear characteristic of robots. There are many other approaches 
for controlling robot arms (Bonitz, 1996, Moudgal, Passino, & Yurkovich, 1995, 
Rocco, 1996). Some of them demand a lot of computer effort. In some cases, 
where very fast response is required, the computational burden could be pro- 
hibiting. The use of fuzzy logic in such a case is very attractive since it requires 
a minimum of computer time. On the other hand, the powerful of fuzzy logic 
is such that, the coupling effects of the joints can be completely compensated 
using the appropriate fuzzy rules. The approach presented in this paper differs 
in some extend to the methods found in the literature (Young, & Shiah, 1997, 
Kawasaki, Bito, & Kanzaki, 1996, Cox, E., 1994) since there is a combination 
of individual fuzzy controllers for each joint and a master fuzzy controller used 
to generate the set points. Therefore, the couplings found in the general robot 
dynamic equation are completely compensated. There are also two fuzzy con- 
trollers for each joint: one for speed control and another for position control. 
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They work in such a way that, speed controller actuates in the first part of the 
trajectory. At the end of it, the position controller takes place leading to a very 
smooth positioning. The decision for the switching of all those controllers is also 
accomplished by the master fuzzy controller. A simplified scheme is presented 
in Fig. 3. In Fig. 3, Q and #Q are the desired position and speed of the joint, 
respectively. The position controller uses the error and error variation as input. 
On the other hand, the inputs for the speed controller are the speed error and 
error variation. Each fuzzy controller has the following characteristics: 

• Gaussian membership functions 

• Defuzzification using the center of area 

• The use of simple product as the minimum operation 

• Concentration operator for the output function. 

The universe of discourse of position for each one of the base joints is [—270° to 
270°] and for the upper joints is [—180° to 180°]. The universe of discourse of 
the output, for each joint is [—192° to 192°], corresponding to the PWM input 
of the motor drives. The rule base was defined using cardinality equal to 7, for 
every controller, although smaller rule basis could also be tested with excellent 
results. The rules were implemented based on the experience acquired from the 
observation of the robot behavior. 




Fig. 3. Control system for join 1 



7 Practical Results 

The result of the enhancement method (Fig. 2) is shown in Figures 5 a 10. 

The results shown in figures 5 to 10, were obtained directly from the camera 
acquisition system. The Fig. 5 and 8 are the original image and histogram of 
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Fig. 5. Original Image 




Fig. 7. Image after smoothing 



Fig. 6. Image after first enhance 



Fig. 8. Image after second enhance 




the original image. In the next step the original image is Enhance, smoothed 
Fig. 5(b) and them a second enhancement Fig. 5(c). Finally the binary image 
Fig. 5(c) and 5(e) is obtained and applied in the control system of the robot 
manipulator. 

The results shown in figures 5(a) to 5(f), were obtained directly from the 
camera acquisition system. The figures 6 and 9 are the original image and his- 
togram of the original image. In the next step the original image is Enhance 
figure 6, smoothed figure 7 and them a second enhancement figure 8. Finally the 
binary image figures 8 and 10 is obtained and applied in the control system of 
the robot manipulator. 

The results shown in figures 11 to 16, were obtained directly from the robot 
acquisition system. In the test, each joint started from the 50°. The desired final 
angle, for each joint is showed as a continuous line, in figure. Some very small 
overshoots could be seen. 

But they could be very well compensated with a more careful adjustment 
of the rule base. The switching from the speed to the position controller is ac- 
complished by the master fuzzy controller. Figure 17 explains how the switching 
idone. 



8 Conclusions 

The results obtained shown that the method applied is effective for controlled 
environment. The model Fig. 3, shown good possibilities for improvement since 
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the method used in block “S”, may be altered, by means of, for instance a max- 
mim filter, and in this way improve the results sent to the second block “E”. 

The study also shows that although the averaging filter had been used, with 
only two interactions in each block “E”, we have obtained a binary image. 

The results obtained for image enhancement using fuzzy logic, show us effec- 
tive, very speed and of the simple implementation. 
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Abstract. The notion of lattice has a wide variety of applications in 
the areas of physical sciences, communication systems, and information 
analysis. Fuzzy set theory, with its proximity to a a multitude of areas 
closely related to AI, provides an alternative to the traditional concept 
of set membership. Nanda[4] proposed the concept of fuzzy lattice, using 
the notion of fuzzy partial ordering. But after a critical study, it has been 
observed that his definition contains some redundency. As a consequence 
of this observation, we present a modified definition of fuzzy lattice in 
this paper. . . . 



1 Introduction 



Lattice structure has been found to be extremely important in the areas of 
quantum logic, ergodic theory, Reynold’s operators, communication systems and 
information analysis systems. Some system models often include excessive com- 
plexity of the situation which in turn may lead to consequences where it is 
difficult to formulate the model or the model is too complicated to be used in 
practice. This practical inconvenience is caused by our inability to differenti- 
ate events in real situations exactly and thus to define instrumental notions in 
precise form. 

The concept of human cognition and interaction with the outer world involves 
objects that are not sets in the classical sense, but fuzzy sets, which imply 
classes with unsharp boundaries in which the transition from membership to 
non-membership is gradual rather than abrupt. 

Nanda[4] proposed the notion of fuzzy lattice using the concept of fuzzy 
partial ordering. After performing a critical study, we observe some amount of 
redundency present in his definition and his definition is found to be incomplete. 
As a consequence of this redundancy, we modify the definition of fuzzy lattice 
in the present paper. 
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2 Preliminaries 



Let Q be any set and let it be a fuzzy relation defined over Q. Then, R is said 
to be 

— Max- min transitive if it o it C it or more explicitly, if V(x, y, z) e 

^R{x,z) — ^R{y,z)} 



— Reflexive if Vx G fi, 

— Perfect antisymmetry if 



f^R(x^x) ^ 



V(x,y) G ,x 7^ y^f^R(x,y)) > ^ ^ y'R{y,x) ~ ^ 

where represents the membership value of the pair {x^y) in it. 



A fuzzy relation F defined over a set i? is said to be fuzzy partial ordering 
if and only if it is reflexive, max-min transitive and perfectly antisymmetric. A 
set Q along with a fuzzy partial ordering P defined on it, is called a fuzzy 
partially ordered set. 

Let i? be a fuzzy partially ordered set with a fuzzy partial order P defined 
over it. With each x G ii, we associate two fuzzy sets 

— The dominating class, P > (x), defined by 

P > {x){y) = P{y,x). 

— The dominated class, P < (x), defined by 

P < {x){y) = P{x,y)- 



Let M be a non-fuzzy subset of i?. Then the fuzzy upper bound of M, 
denoted by is a fuzzy set defined by 

p4>{m) = n p>R)- 

xeM 

The fuzzy lower bound of denoted by L^(m) i® ^ fuzzy set defined by 

L4,{M) = U P < {x). 

xeM 



3 Comment 



In his paper [4], Nanda defined the notion of fuzzy lattice in the following way. 
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3.1 Definition 

A fuzzy partially ordered set X (a non-fuzzy set X with a fuzzy partial order 
defined on it) is called a fuzzy lattice if every two-element non-fuzzy set (i.e. 
every pair of elements) in X has fuzzy lower bound and fuzzy upper bound. 



3.2 Observation 

After a critical study, we analysed the above definition as proposed by Nanda[4], 
and observed that in case of any fuzzy partially ordered set X, every two element 
non-fuzzy set in it has fuzzy lower bound and fuzzy upper bound which are 
nothing but two fuzzy subsets of X, Hence we conlude that this definition is 
incomplete in the sense that according to this definition, every fuzzy partially 
ordered set becomes a fuzzy lattice. Hence we modify the concept and redefine 
the notion of fuzzy lattice. 



4 Modified Definition 

The modified definition of fuzzy lattice is as follows: 



4.1 Definition 

Let A be a fuzzy partially ordered set and let A be a fuzzy subset of A. Then 
A is said to be a fuzzy lattice in A if every pair of elements in A has a fuzzy 
lower bound and a fuzzy upper bound (where both and are fuzzy 
subsets of A) satisfying the following two conditions: 

/^max VxeX 

/^min VxeX 

4.2 An Example 

Let A = be a set and P be a fuzzy partial order defined over A as 

below: 



F 


6 


6 


6 




1 


1 


bo 




0 


1 


0 




~o 


.7 


1 



Also let A = {^i/O, ^2/0, ^s/A} be any fuzzy subset of A. 
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In this case, we have 



= {6/i,6/i,6/-8} 

^</>(ei.f3) = {6/-8,6/o,6/o} 

-^</>(ei.f3) = {6/1, 6/1, 6/1} 

f^^te.f3) = {6/o,6/o,6/.7} 

^^^2,^3) = (6/0, 6/1, 6/1} 



Therefore, 



max{U^} = {6/l,6/0,6/-7} 



and 

inm{L^} = |6/0,6/l,6/-8} 
Hence for all ^ G we have 



Mmax{L/^}(6 6 Ma(0, 

l‘min{L^}('?) 6 Ma( 6- 
Thus H is a fuzzy lattice in X. 

But in this case, if we consider B = {^i/-5, ^ 2 / 0 , then B is not a fuzzy 

lattice in X, 



4.3 A Trivial Property 

Let A be a fuzzy lattice in the fuzzy partially ordered set X and let B be any 
fuzzy subset of X such that Vx G X. Then B is also a fuzzy 

lattice in X, 



Proof: Let any two element non-fuzzy set in X have the fuzzy lower bound 
L(f) and the fuzzy upper bound Since A is a fuzzy lattice in A, hence for all 
X G A, we have 

Mmax{[/^}(^) A 
/^min{L^}(^) — 

Also we have here. 



Hence, for all x G A, 

/^max{L/^}(^) A 
/^min{L^}(^) — 



This implies B is also a fuzzy lattice and hence the property. 
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5 Conclusion 



The most important semantics associated with the use of fuzzy sets is the ex- 
pression of proximity and representation of incomplete states of information. 
In view of the possible areas of application of lattice structure in case of soft 
computing, we studied the notion of fuzzy lattice as proposed by Nanda. Some 
redundancies in this definion has been observed and consequently a modified 
definition of fuzzy lattice has been proposed. 
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Abstract. Recently various continuous adaptive fuzzy control schemes 
have been proposed to deal with nonlinear systems with poorly under- 
stood dynamics by using parameterized fuzzy approximators. However, 
practical applications call for discrete-time adaptive fuzzy controller de- 
sign because almost all these controllers are implemented on digital com- 
puters. To meet such a demand, in this paper a discrete- time adaptive 
fuzzy control scheme is developed. The strategy ensures the global sta^ 
bility of the resulting closed-loop system in the sense that all signals 
involved are uniformly bounded and tracking error will be asymptoti- 
cally in decay. 



1 Introduction 

The application of fuzzy set theory to control problems has been the focus of 
numerous studies [1]. The motivation is often that the fuzzy set theory provides 
an alternative to the traditional modeling and design of control systems when 
system knowledge and dynamic models in the traditional sense are uncertain and 
time varying. Without any doubt, the stability is the most important require- 
ment for any control system. It is highly desirable to design a fuzzy controller 
which guarantees the stability. Recently, some researches have been focused on 
use of the Lyapunov synthesis approach to construct stable adaptive fuzzy con- 
trol system[2-6]. The most fundamental idea may refer to the creative works in 
[2,3]. Namely, to deal with unknown control systems, the fuzzy model would be 
considered as an approximation model to approximate unknown linear or non- 
linear functions in the plant, where the fuzzy model is expressed as a serious of 
fuzzy basis function (FBF) expansion. Ultimately, instead of using the unknown 
functions, the fuzzy model is utilized directly to construct the control inputs 
based on the Lyapunov synthesis approach. 

However, thus designed fuzzy control schemes are usually implemented on digital 
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computers because where computers are the manipulators in practical applica- 
tions, so there has existed a gap between the designed and realizable control 
algorithms. With the increasing applications of advanced computer technolo- 
gies, it is much more meaningful to design control schemes in discrete-time. So 
far, however, all of the adaptive fuzzy control systems have been designed only 
for a continue-time system except using the so-called T-S type fuzzy model[7]. 
In this paper, we will introduce a stable discrete-time adaptive fuzzy control 
algorithm for a class of unknown sampled-data nonlinear systems. The control 
scheme considered here is an integration of fuzzy control component, in which 
the FBF expansion can be considered as an universal approximator, with slid- 
ing control component of the variable structure control with a sector [8, 9]. The 
developed controller guarantees the global stability of the resulting closed-loop 
system in the sense that all signals involved are uniformly bounded and tracking 
error will be asymptotically in decay. 



2 System Description 

We are interested in the single-input /single-output nonlinear discrete-time sys- 
tem 

x{k -h 1) + / {x{k — n + l),x{k — n + 2), . . . , x{k)) = bu{k) (1) 

where k is current time, x is the output, and u is the input in the system. /(•) is 
nonlinear function with n being system order. It is assumed that the order n is 
known but the nonlinear function /(•) is unknown. It should be noted that more 
general classes of nonlinear systems could be transformed into this structure[IO]. 
The control objective is to force the state vector X (k) = [x\ (fc), X 2 (A^), • - • , Xn(k)]^ 
= [x{k — n+ l)jx{k — n + 2), . . . , x(A;)] to follow a specified desired trajectory, 
^d{k) = [xd{k — n + 1), Xrf (A: — n + 2), . . . ,Xrf(A:)]^. Defining the traking error 
vector, X{k) = X{k)—Xd{k)^ the problem is thus to design a control u{k) which 
ensures that X{k) — 0, as A: — >> 00 . For simplicity in this initial discussion, we 
take 6 = 1 in the subsequent development. 



3 Fuzzy Approximator 

We consider a fuzzy system for which there are four principal elements in such 
a fuzzy system: fuzzifier, fuzzy rule base, fuzzy inference engine, and defuzzifier. 
Let input space JY G be a compact product space. Assume that there are N 
rules in the rule base and each of which has the following form: 

Rj : IF xi{k) is and X 2 {k) is and 



. . . and Xn{k) is AJ, THEN z{k) is Bj 
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where j = 1, 2, . . . , AT, Xi{k){i = 1, 2, . . . , n) are the input variables to the fuzzy 
system at current time k, z{k) is the output variable of the fuzzy system at 
current time k, and Aj and Bj are linguistic terms characterized by fuzzy mem- 
bership functions respectively. 

As in [11], we consider a subset of the fuzzy systems with singleton fuzzifier, prod- 
uct inference, and Gaussian membership function. Hence, such a fuzzy system 
can be written as 

= — M ^ ( 2 ) 

Ef=inr=iMA^.Mfc)) 



where h: U C R'^ R, ujj{k) is the point in R at which pBj{^j{k)) = 1, named 
as the connection weight; fhe Gaussian membership function, de- 

fined by 



=exp 



/ ar(fe)-gj(fe) y 
\ j 



( 3 ) 



where cTj{k) and Cj(^) s^re real- valued parameters, in which Cji^) indicates the 
position and crj(A;) indicates the variance of the membership function. Here, if we 

riLi 

take — r-TTL ^ in (2) as basis functions and ujj{k) as coefficients, h{X), 

J 

then, can be viewed as a linear combination of the basis functions. Therefore, 
we give a definition regarding the basis function as follows. 



Definition 1 Define fuzzy basis functions (FBF) as 



g^iX) = ]ltiAdxiik)), j = l,2,...,N (4) 

i=l 

where (xi{k)) is the Gaussian membership functions defined in (3). 

3 

Then, the fuzzy system (2) is equivalent to a FBF expansion 

h(X) = ■ 9j{X) ^ W^{k) ■ G{X(k)) (5) 



where, W^(fc) = [o;i(fc),a; 2 (A:), • • • , wjvWf , G(X(A:)) = [giiX{k)),g 2 {X{k)),- ■ ■ , 
gM{X{k))] , and notation = means a definition. For convenience, throughout 
this paper, the notation (•) is sometimes omitted when no confusion is likely to 
arise. For example we will express G{X{k)) as G{k) with k being current time. 
We now show an important property of FBF expansion [3]. 

Theorem 1 For any given real continuous function f on the compact set U € 
R*^ and arbitrary e, there exists optimal FBF expansion h*{X{k)) = W*{k) • 
G{X{k)) such that 

sup \f{X{k))-h%X{k))\<8 
xeu 



( 6 ) 
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This theorem states that the FBF expansion (5) is universal approximator on a 
compact set. Herein, we use terms fuzzy universal approximator or fuzzy approx- 
imator to refer to the FBF expansion. Since the fuzzy universal approximator is 
characterized by parameter vectors TF, the optimal h* does contains an optimal 
vector W*. 



4 Adaptive Fuzzy Control System 

In this paper, we adopt the variable structure theory to construct our adaptive 
fuzzy control system. An error metric is firstly defined as 

s[k) = {q~^ + X)'^~^x{k) with A > 0 (7) 

where q~^ is the delay operator, and A defines the bandwidth of the error dynam- 
ics of the system. The error metric above can be rewritten as s{k) = A^X{k) with 
(n — 1)A’^“^, • • • , 1]. The equation s{k) = 0 defines a time-varying 
hyperplane in on which the tracking error vector X (k) decays exponentially 
to zero, so that perfect tracking can be asymptotically obtained by maintaining 
this condition [13]. In this case the control objective becomes the design of con- 
troller to force s{k) = 0. 

An increment As{k -\-l) can be written as 

As{k + 1) = s{k + 1) — s{k) 

= AiX{k) + X^-^x{k^l) (8) 

where Af = [(n — — A^“^, • • • , 1 — (n — 1)A, — l]. Substituting x{k-\-l) = 

x{k 1) — Xd{x 1) into (8) and combining (1) yield, 

jf::^As{k + 1) = u{k) - h{k) (9) 

where h(k) denotes 

h{k) = /(X) + xa{k + 1) - (10) 

It naturally suggests that when h{X) is known, a control input of form 

u{k) = —kds{k) + h{k), kd > 0 (11) 

leads to a closed-loop system As{k) = —kds{k)^ and hence, X [k) ^ 0 as k ^ oo. 
The problem is how u(k) can be determined when h{X)^ which concludes an 
unknown functions f{X) and 6, are unknown. Therefore we have to approximate 
them to achieve control objective. Here the fuzzy approximator described in 
the previous section is used. Let us denote h*{X{k)) = W^^'^Ghik) to be the 
optimal fuzzy approximator of the unknown function h{X{k)). However, we 
have no idea to know the optimal parameter vector in the optimal fuzzy 
approximator. Generally, the estimate, denoted h{X{k)) = W^Gh{k), is adopted 
instead of the optimal fuzzy approximator /i*(X(A;)). Regarding the optimal 
fuzzy approximator, we make following assumptions. 
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Assumption 1 The optimal fuzzy approximator error ^ dh{k) = h"^{X{k)) — 
h{X{k))y satisfy the following inequality, where is known small positive value. 

sup |dfe(A;)| < (12) 

xeu 



Assumption 2 In defined subspace 9 there are some positive 

constants ; that satisfies the following inequalities. 



\\m{k)-W:h<c^^, 


yWn{k) e 


(13) 




V4(*) G Ce^ 


(14) 


where ih{k) is the estimate of e*p as well as 


Wh{k). 





Remarks: 



1. Base the Theorem 1, the assumption 1 is reasonable to provide. 

2. It seems the assumption 2 is difficult to satisfy. However, in the each defined 
subspace where the optimal parameter (or vector) is contained, such an 
inequality is easy to check. Therefore, the problem is how to force the each 
estimate to enter the each defined subspace. At the upcoming algorithm, the 
projection algorithm is adopted to ensure each estimate is within the each 
defined subspace. 

Inspired by the above control structure in (11), using the fuzzy approximator 
h{X), our adaptive control law is now described below: 

u(^) — H“ u^^(A:) (^^) 



where Ufu{k), expressed by. 




Ufu(k) = h{k) — £}i{k)sgn{s{k)) 




= Wh{k)Gh{k) - ih{k)sgn{s{k)) 


(16) 


is fuzzy component of control law which will attempt to recover or 
unknown function h. And the adaptive component is synthesized by 


cancel the 


Wh(k) = p [W^{k - 1) - rkGhik)s{k)] 


(17) 


£h{k) = p{ih{k - 1) +7fe|s(A;)|} 


(18) 



where and ih are the estimates of WJl , and respectively; p represents 
the projection operator necessary to ensure that Wh{k) G and £h{k) G Ce^ 
for VA: [14]; G and 7/^ > 0 determine the rates of adaptation in which 

Ffi is a symmetric positive define matrix and selected to be satisfy the following 
inequality: 

2 



Glik)rhGH{k) < 



(19) 
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It should note that the term like ih{k)sgn{s{k)) in (16) actually reflects the 
component for compensation of the approximating error (h{k) — h{k)). If the 
fuzzy approximator provide good description of the unknown function /i(fc), then 
the term ih {k) should be small as well as the optimal fuzzy approximator error 
dh{k). Conversely, if the fuzzy approximator is poor, Sh{k) will rise automatically 
to the necessary level, ensuring the stability of the overall system. 

'^vu{k) is the variable structure control component that is expressed by 



Uvu{k) = (ihiCwA\Gh{k)\\ (20) 

The variable structure control component only works outside of the sector i?(/c)[8] 
defined by 





n{k) = Qhiik) U Qhi{k) 


(21) 


nhiik) 


= {«(*)! ll|G'ft(A:)||s(A;)|c^^ 


(22) 


fihiik) = {s(A:)| |s(fc)|ce^ < Phi} 


(23) 


Phi 




(24) 






(25) 




A{k) = C^AGh(k)\\+Ce, 


(26) 



where ^ is a positive constant chosen to guarantee the system robustness. And 
the switching type coefficients in (20) are determined as follows. 



Phi = 



Ph2 = 





for 




||Gft(A:)||s(A:) < -phi 


0, 


for 


Cwh 


|||Gft(fc)||s(fc)| < Phi 


< 


for 




||Gft(A:)j|s(A:) > phi 


\ 




for 


CefcS(fc) < -Phi 




0, 


for 


Ce^|s(A:)| < Phi 


\ 




for Ce^s(fc) > Phi 



(27) 



(28) 



Remarks: 



1. In contrast with the continuous-time adaptive fuzzy control system [6], the 
term of variable structure control component u^uik) corresponds to the slid- 
ing component Usu{t)- In [6], via a modulate m{t) the Usui^) only works on 
the specified region where the fuzzy approximator could not effectively 
approximate the unknown function in a sense. Similarly, here u^u{k) only 
works outside of the sector f2{k) which is defined on the tracking error metric 
s{k) via the switching type coefficients (27^ 28). 
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2. Compared with the other adaptive schemes given in [8] [9], there is an impor- 
tant difference. In [8] [9], the convergence of tracking error metric depends on 
a assumption like sup^ d;j(/c) < d, where d is a known constant, which may 
not be easy to check, because it could not ensure that Cjj is bounded. On 
the other hand, in our scheme, by using a projection algorithm we can easily 
realize W{k) C so that ||W(/?)|| < This will enhance the flexibility of 
the system. 

The stability of the closed-loop system described by (1) and (15-28) is established 
in the following theorem. 

Theorem 2 Under the assumptions 1 and 2, if the plant (1) is controlled by 
(15)j (16)j (20) y (27-28) y and the adaptive component is synthesized by (17-18), 
then the tracking error metric of the system will stably enter the sector defined 
by (21-26). When the system tracking error metric is driven inside the sector, 
|s(/o + 1)1 is usually in a small magnitude such that it can be assumed that 
|zis(fc + 1)1 < (( 7 ft - fcd)A”-i)V2|s(fc)| with 0<kd< jh- Then, X{k) ^ 0 as 
k oo and the all signals are bounded. 

We have no enough room to give the stability proof and present the simulation 
results which confirm the correctness of the proposed control algorithm. 



5 Conclusion 

In this paper, we proposed a discrete- time adaptive fuzzy control scheme for 
a class of unknown sampled- data nonlinear systems. The developed controller 
guarantees the global stability of the resulting closed-loop system in the sense 
that all signals involved are uniformly bounded and tracking error would be 
asymptotically in decay. 
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Conditional Probability Relations in Fuzzy 
Relational Database 



Roily Intan and Masao Mukaidono 
Meiji University, Kanagawa-ken, Japan 



Abstract. In 1982, Buckles and Petry [1] proposed fuzzy relational 
database for incorporating non-ideal or fuzzy information in a relational 
database. The fuzzy relational database relies on the spesihcation of sim- 
ilarity relation [8] in order to distinguish each scalar domain in the fuzzy 
database. These relations are reflexive, symmetric, and max-min transi- 
tive. In 1989, Shenoi and Melton extended the fuzzy relational database 
model of Buckles and Petry to deal with proximity relation [2] for scalar 
domain. Since reflexivity and symmetry are the only constraints placed 
on proximity relations, proximity relation is considered as generalization 
of similarity relation. 

In this paper, we propose design of fuzzy relational database to deal 
with conditional probability relation for scalar domain. These relations 
are reflexive and not symmetric. We show that naturally relation between 
fuzzy information is not symmetric. In addition, we dehne a notion of re- 
dundancy which generalizes redundancy in classical relational database. 
We also discuss partitioning of domains with the objective of developing 
equivalence class. 



1 Introduction 

Fuzzy relational database which is proposed by Buckles and Petry(1982) [1], as in 
classical relational database theory, consists of a set of tuples, where ti represents 
the i-th tuple and if there are m domains T), then ti = (d^i, di 2 ^ In the 
classical relational database, each component of tuples dij is an atomic crisp 
value of tuple ti with the restriction to the domain Dj , where dij G Dj . However, 
in fuzzy relational database, each component of tuples dij is not limited to atomic 
crisp value; instead, dij C Dj {dij ^ 0) as defined in the following definition. 

Definition 1. [1], A fuzzy database relation^ R, is a subset of the set of 

cross product 

2^1 X 2^2 X ... X 2^"*, where 2^^ = 2^^ - 0. 

A fuzzy tuple is a member of a fuzzy database relation as follows. 

Definition 2. [1]. Let R C 2^^ x 2^^ x • • • x 2^^ he a fuzzy database relation, 
A fuzzy tuple t (with respect to R) is an element of R, 

W. Ziarko and Y. Yao (Eds.): RSCTC 2000, LNAI 2005, pp. 251-260, 2001. 
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Even though the fuzzy relational database considers components of tuples 
as set of data values from the corresponding domains, by applying the concept 
of equivalence classes, it is possible to define a notion of redundancy which is 
similar to classical relational database theory. The starting point is the definition 
of an interpretation of a fuzzy tuple as shown in the following definition. 

Definition 3. [1], Letti = he a fuzzy tuple. An interpreta- 

tion of ti is a tuple 0 = (ai, U 2 , ..., a^) where aj G dij for each domain Dj. 

It is important to note that each component of an interpretation is an atomic 
value. In classical relational database, each tuple is the same as its interpretation 
because every component of tuples is an atomic crisp value. The process of 
determining redundancy in the term of interpretations of fuzzy tuples is defined 
as follows. 

Definition 4. [1]. Two tuples ti and tj are redundant if and only if they possess 
an identical interpretation. 

The definition of redundancy above is a generalization of the concept of redun- 
dancy in classical database theory. Specifically, the absence of redundant tuples 
in classical relational database means there are no multiple occurrences of the 
same interpretations. Similarly, in fuzzy relational database, there should be no 
more than one tuple with a given interpretation. 

The fuzzy relational database of Buckles and Petry relies on the specification 
of similarity relation for each distinct scalar domain in the fuzzy database. A 
similarity relation, Sj , for a given domain, Dj , maps each pair of elements in the 
domain to an element in the closed interval [0,1]. 

Definition 5. [d] A similarity relation is a mapping^ Sj : Dj x Dj [0, 1]; 
such that for x^y^ z £ Dj^ 

Sj(x^x) = 1 (refiexivity), 

Sj y) = Sj (y, x) (symmetry), 

Sj(x, z) > max{min[sj(x, y), Sj{y^ z)]}(max — min transitivity) 



The same with similarity relation in fuzzy relational database of Buckles and 
Petry, a specific type of similarity relation known as the identity relation is used 
in classical relational database as defined in the following definition. 

Definition 6. A identity relation is a mapping^ Sj : Dj x Dj {0, 1}, such 
that for X, y G Dj ^ 



Sj(x,y) 



Ilf x=y, 

0 otherwise. 



There is considerable criticism about the use of similarity relation, especially 
for the point of max-min transitivity in fuzzy relational database (see e.g., [7]). 
To easily understand this criticism, a simple illustration is given by an example. 
Let us suppose that one is similar to two to a level of 0.8, and two is similar to 




Conditional Probability Relations in Fuzzy Relational Database 253 



three to a level of 0.8. According to inax-niin transitivity, the similarity between 
one and three must be no less than 0.8. Therefore max-min transitivity is consid- 
ered as a very restrictive constraint. Considering this reason, in 1989, Shcnoi and 
Melton extended fuzzy relational database model of Buckles and Petry to deal 
with proximity relation for scalar domain as defined in the following definition. 

Definition 7. [2] A proximity relation is a mapping, Sj : Dj x Dj [0, 1], 
such that forx.y C Dj, 



Sj{x,x) = 1 (reflexivity), 

Sj{x,y) = Sj{y,x) (symmetry), 

In their other paper [6], Shenoi and Melton introduced a notion of a-redundant 
tuples as defined in the following dehnition. 

Definition 8. Two tuples, U a:ndtj, axe a-redunda,nt, denote byti tj, where 
a = (oi, ..., am), whenever 

dll djm- 

In general, a is a subset of levels associated with U and tj . 

However, every word naturally has different range of meaning where there 
are some words have more general meaning than the others. For simple example, 
talking about hair color, the word ‘Brown’ is more general and broader, while 
‘Light Brown’ is narrower and more specific. The word ‘Brown’ can cover a 
wider range of meaning in color than the word ‘Light Brown’. So the range of 
meaning in color is different in these two words. In our sentences, it is correct 
to say that ‘Light Brown is Brown’, but not ‘Brown is Light Brown’. Moreover, 
we can say that similarity level of ‘Brown’ given ‘Light Brown’ and similarity 
level of ‘Light Brown’ given ‘Brown’ are different. Therefore, we consider the 
relation of similarity between two scalar domain should be not symmetric rather 
than symmetric. Considering this reason, in this paper, wc propose design of 
fuzzy relational database to deal with cond;dional probability reloMon for scalar 
domain. These relations are reflexive and not symmetric. In addition, related 
to this method, we define a notion of redundancy which generalizes redundancy 
in classical relational database and even the concept of redundancy which was 
proposed by Shcnoi and Melton [6]. Wc also discuss partitioning of domains 
with the objective of developing ecpii valence class by using not only a- level to 
determine the degree of non- ideality, but also r as reference data for clustering. 

2 Fuzzy Relational Database with Conditional 
Probability Relation 

In this section, we propose conditional probability relation as a basis of deal- 
ing with scalar domains in fuzzy relational database. A conditional probability 
relation is defined in the following definition. 
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Definition 9. A conditional probability relation is a wrapping ^ Sj : Dj x 

^ [0, 1], such that for x, y G Dj^ 

^jiay) = ( 1 ) 

where Sj{x\y) mwans similarity level of x given y. 

Considering our example in Section 1, it is possible to set 
s hair {Brown\Light Brown) = 0.9, 
s hair {Light Brown] Brown) = 0.4. 

The expression means that similarity level of ‘Brown’ given ‘Light Brown’ is 0.9 
and similarity level of ‘Light Brown’ given ‘Brown’ is 0.4, respectively. These 
conditions can be easily understood by imaging every scalar domain represented 
in set. In our example, size of set ‘Brown’ is bigger than size of set ‘Light Brown’, 
as shown in the following figure. 



Light Brown 




Brown 



Fig. 1. Intersection: ‘Brown’ and ‘Light Brown’ 



Area A which is the intersection’s area between ‘Brown’ and ‘Light Brown’, 
represents similarity area of them. We can calculate the similarity level of ‘Brown’ 
given ‘Light Brown’ and the similarity level of ‘Light Brown’ given ‘Brown’ 
denoted by Shair{Brown\LightBrown) and Shair{LightBrown\Brown)^ respec- 
tively using (1) as follows. 

Ml 

ShairXrown\Light Brown) = 

\L%gtitBrown\ 



Shair{LightBrown\Brown) = -j— r, 

\Brown\ 

where \ Brown] and ]Light Brown] represent size of set ‘Brown’ and ‘Light Brown’, 
respectively. Corresponding to Figure 1, ]Brown] > ]LightBrown] implies 

Shair{Brown]LightBrown) > Shair{LightBrown]Brown). 

Furthermore, ideal information or crisp data and non- ideal information or 
fuzzy information can be represented by using fuzzy sets. In that case, fuzzy set 
can be used as a connector to represent imprecise data from total ignorance (the 
most imprecise data) to crisp (the most precise data) as follows. 
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Definition 10. [3] Let U he universal set^ where U = {ui^U 2 ^ Total 

ignorance(TI) over U and crisp of Ui E U are defined as 

T1 over U = {1/rii, 1/n^}, 

Crisp(wi) = {0/ui, l/ui,0/ui^i, ...,0/un}, 



respectively , 

Similarity level of two fuzzy informations which are represented in two fuzzy 
sets can be approximately calculated by using conditional probabaility of two 
fuzzy sets [3]. In that case, \y \ = where x^ membership function of y 

over n, and intersection is defined as minimum. 

Definition 11. Let x = {xi/^i? Xn/'^^n} y = {xi/^i: Xn/'^^n} 

fuzzy sets over U = j ^ 2 ? •••7 Sj : DjXDj [0, 1], such that for y £ Dj^ 



Sj{x\y) 



Er=i“in{xf,xf} 

V" ’ 

Z— <"j = l X-j 



where Sj{x\y) mxans level similarity of x given y. 

Example L Let us suppose that two fuzzy sets, ‘Warm’ and ‘Rather Hot’ which 
represent condition of temperature, are arbitrarily given in the following mem- 
bership functions. 



Warm = {0.2/22, 0.5/24, 1/26, 1/28, 0.5/30, 0.2/32} 



RatherHot = {0.5/30, 1/32, 1/34,0.5/36} 

By Definition 11, we calculate similarity level of ‘Warm’ given ‘Rather Hot’ and 
similarity level of ‘Rather Hot’ given ‘Warm’, respectively as follows. 



^ temper aturei^^ arm\Rather H ot^ 



min(0.5, 0.5) -h min(0.2, 1) 
0.5 + 1 + 1+0.5 



0.7 

“3”’ 



s temper aturei^diat her H ot\W oxrnfj 



min(0.5, 0.5) + min(0.2, 1) 0.7 

0.2 + 0.5 + 1 + 1 +0.5 + 0.2 “ 



Calculation of similarity level based on conditional probability relation im- 
plies some conditions in the following theorem. 



Theorem 1. Let Sj{x\y) he similarity level of x given y and Sj{y\x) he similarity 
level of y given x, such that for x^y^ z £ Dj^ 



1. 


if 


Sj{x\y) 


= 


Sj{y\x) 


= 1 then X = y. 


2, 


if 


Sj{y\x) 


= 


1 and 


Sj{x\y) < 1 then x C y, 


3. 


if 


Sj{x\y) 


= 


Sj{y\x) 


> 0 then |x| = |y|. 


1 


if 


Sj{x\y) 


< 


Sj{y\x) 


then |x| < |y|. 


5. 


if 


Sj{x\y) 


> 


0 then 


Sj{y\x) > 0, 


6. 


if 


Sj{x\y) 


> 


Sj{y\x) 


> 0 and Sj{y\z) > Sj{z\y) > 0 then Sj{x\z) > Sj{z\x) 
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The conditional probability relation in Definition 9 is defined through a model 
or an interpretation based on conditional probability. From Theorem 1, we can 
define two other interesting mathematical relations, resemblance relation and 
conditional transitivity relation^ based on their constraints represented by axioms 
in representing similarity level of a relation as follows, although we do not clarify 
their properties in this paper. 

Definition 12. A resemblance relation is a mapping^ Sj : Dj x Dj ^ [0, 1], 
such that for x, y G Dj ^ 

Sj{x^x) = l (refiexivity), 
if Sj{x^y) > 0 then Sj{y^x) > 0. 

Definition 13. A conditional transitivity relation is a mapping^ Sj : Dj x 
Dj ^ [0, 1], such that for x, y G Dj^ 

5j(x,x) = l (refiexivity), 
if Sj(x, y) > 0 then Sj(y, x) > 0, 

if Sj{x^y) > Sj{y^x) > 0 and Sj(y, z) > Sj{z^y) > 0 then Sj{x^z) > Sj{z^x). 

Since a relation which satisfies conditional transitivity relation must satisfy 
resemblance relation, it is clearly seen that resemblance relation is more general 
than conditional transitivity relation. On the other hand, conditional transitiv- 
ity relation generalizes proximity relation, because proximity relation is just a 
spesific case of conditional transitivity relation. Related to Theorem 1, we also 
conclude that conditional probability relation which is defined in Definition 9 is 
also a specific example of conditional transitivity relation. Generalization level 
of identity relation, similarity relation, proximity relation, conditional probabil- 
ity, conditional transitivity relation and resemblance relation is clearly shown in 
Figure 2. 



Resemblance Relation 



Conditional Transitivity Relation 



Conditional Probability Relation 

Proximity Relation 

G Similarity Relation^ 
^dentity Relatioi^ 



Fig. 2. Generalization level 
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3 Redundancy in Tuples 

In this section, we discuss and define a notion of redundancy which generalizes 
redundancy in classical relational database and even the concept of redundancy 
in fuzzy relational database proposed by Shenoi and Melton [6] . 

Based on the Theorem 1, we define a notion of redundancy tuples as follows. 

Definition 14. Tuple T = (d^i, dim) 'Is a-redundant in relation R if there 

is a tuple tj = (d^i, <7^2, •••, djm) which covers all information ofti with the degree 
of non-ideality a, where a = (ai , whenever 

Vx e dik, 3y e djk, Sk{y\x) > au 



for k = 1,2, ..., m 

In classical relational database, all of its scalar domain are atomic values and 
each distinct scalar domain is disjoint. It is clear that the identity relation is 
used for the treatment of ideal information where a domain element may have 
no similarity to any other elements of the domain; each element may be simi- 
lar only unto itself. Consequently, a tuple is redundant if it is exactly the same 
as another tuple. However, in fuzzy relational database, a domain element may 
have similarity level to any other elements of the domain. Moreover, considering 
the range of meaning, a fuzzy information may cover any other fuzzy informa- 
tion (i.e., ‘Brown’ covers ‘Light Brown’) with the certain degree of non-ideality. 
Therefore, we define the concept of redundant tuple in fuzzy relational database 
as defined in Definition 14 where components of tuples may not be single value 
as proposed in the fuzzy relational database model of Buckles and Petry. Com- 
pared to Definition 4 and 8 which also define redundant tuples. Definition 14 
appeals to be more general as a consequence that symmetry is just a special case 
in conditional probability relation. 

Example 2, Let us consider the relation scheme ARTIST (NAME, AGE, APTI- 
TUDE). An instance of fuzzy relation is given in Table 1 where each tuple rep- 
resents someone’s opinion about the artist who is written in the tuple. Domain 
NAME is unique which means that every tuple with the same name indicates 
the same person. 



Table 1. ARTIST Relation 



NAME(N) 


AGE(A) 


APTITUTE(AP) 


John 

4bm 

David 

Tom 

David 


Young 
Young 
Middle Age 
[20,25] 
About-50 


Good 

{Average, Good} 
Very Good 
Average 
Outstanding 
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Table 2. Similarity Level of AGE 





Young 


[20,25] 


Midle Age 


About-50 


Young 


1.0 


0.8 


0.3 


0.0 


[20,25] 


0.4 


1.0 


0.2 


0.0 


Middle Age 


0.3 


0.4 


1.0 


0.9 


About- 50 


0.0 


0.0 


0.5 


1.0 



Table 3. Similarity Level of APTITUDE 





Average 


Good 


Very Good 


Outstanding 


Average 


1.0 


0.6 


0.3 


0.0 


Good 


0.6 


1.0 


0.8 


0.6 


Very Good 


0.15 


0.4 


1.0 


0.9 


Outstanding 


0.0 


0.3 


0.9 


1.0 



Let us suppose that similarity level of scalar domain AGE and APTITUDE 
are given in Table 2 and 3. 

From Table 2, similarity level of Young given [20,25], sa{Y oung\[20^ 25])^ 
is equal to 0.8, on the other hand, similarity level of [20,25] given Young^ 
s^([20, 25] is equal to 0.4. In that case. Young covers a voider range 

of meaning in AGE than [20, 25]. No^v, we \vant to remove redundant tuples in 
Table 1 with arbitrary a = (1.0, 0.0.7, 0.8) \vhich coresponds to N, A, and AP, 
\vhere a^r = 1.0, a a = 0.7, a.AP =0.8. We must set a to 1.0 especially for do- 
main NAME, because domain NAME is crisp domain and each distinct scalar 
domain indicates different person. 

By applying the formula in Definition 14, there are t^vo redundant tuples, 
(2 om, [20, 25], Average) and {David^ About — 50, Outstanding) ^ \vhich are cov- 
ered by {Tom^ Young ^ {Average^ Good}) and {David^ Middle Age^ Very Good)^ 
respectively. Table 4 shoves the final result after removing the t^vo redundant 
tuples. 



Table 4. ARTIST Relation (free redundancy) 



NAME(N) 


AGE(A) 


APTITUTE(AP) 


John 

4bm 

David 


Young 
Young 
Middle Age 


Good 

{Average, Good} 
Very Good 
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4 Clustering in Scalar Domains 

In fuzzy relational database model of conditional probability relation, there are 
two parameters, a-cut and r which are required to produce partitioning scalar 
domains into equivalence classes or disjoint clusters. In that case, a corresponds 
to the degree of non- ideality and r means reference which is used as reference 
data to be compared to in the process of partitioning scalar domain. 

Definition 15. If s : D x D ^ [0^ 1] is a conditional probability relation^ then 
the equivalence class (disjoint partition) on D with a-cut and reference r is 
denoted by and is given by 

= {xeD\s{x\r) > a}. 

Example 3, Given conditional probability relation for domain Hair Color ^ is 
shown in Table 5. 



Table 5. Similarity Level of Hair Color 





Bk 


DB 


A 


R 


LB 


Bd 


Be 


Black(Bk) 


1.0 


0.8 


0.7 


0.5 


0.2 


0.0 


0.0 


Dark Brown(DB) 


0.8 


1.0 


0.8 


0.6 


0.4 


0.2 


0.0 


Auburn(A) 


0.7 


0.8 


1.0 


0.9 


0.6 


0.4 


0.0 


Red(R) 


0.5 


0.7 


0.9 


1.0 


0.8 


0.6 


0.3 


Light Brown(LB) 


0.1 


0.2 


0.3 


0.4 


1.0 


0.8 


0.6 


Blond(Bd) 


0.0 


0.1 


0.2 


0.3 


0.8 


1.0 


0.9 


Bleached(Bc) 


0.0 


0.0 


0.0 


0.1 


0.4 


0.6 


1.0 



Let us suppose that we want to cluster the data with r = Red; this gives rise 
to the following clusters for various values of a, 

a G (0.7, 1.0] : {Auburn^ Redj\ 

a G (0.4, 0.7] : {Doxk Brown^ Black} 

a G [0.0, 0.4] : {Blond^ Bleached^ Light Brown} 

5 Conclusions 

In this paper, we extended fuzzy relational database to deal with conditional 
probability relations for scalar domain. These relations are reflexive and not 
symmetric. We showed that naturally similarity level of two fuzzy informations 
or non-ideal informations is not symmetric. Similarity level of two fuzzy infor- 
mations may be symmetric if and only they have the same range of meaning. 
Therefore, conditional probability relation generalizes proximity relation which 
is proposed by Shenoi and Melton in 1989. Moreover, we proposed two other re- 
lations related to the conditional probability relation, resemblance relation and 
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conditional transitivity relation which are useful for the treatment of imprecise 
informations. We also defined a notion of redundancy which also generalizes the 
concept of redundancy in classical relational database. Finally, process of clus- 
tering scalar domain was given by using not only a-level to determine the degree 
of non- ideality, but also r as reference data for clustering. In our next paper, we 
will show that this process of clustering scalar domain is very applicable in the 
application of approximate data querying. 
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Abstract. In this paper, we propose a new class of necessity measures 
which satisfy (Rl) Na{B) > 0 ^ > 0; [A]^_^ C (B)e^ (R2) 3h* € 

(0,1); Na{B) > h" a C B and (R3) Na{B) = 1 (A)o C [B],. 
It is shown that such a necessity measure is designed easily by level cut 
conditioning approach. A simple example of such a necessity measure is 
given. The proposed necessity measure is applied to fuzzy rough set based 
on certainty qualifications. It is demonstrated that the proposed necessity 
measure gives better upper and lower approximations of a fuzzy set than 
necessity measures defined by S-, R- and reciprocal R-implications. 



1 Introduction 

In [3], we showed that a necessity measure can be obtained by specifying a level 
cut condition. However, the usefulness of this result was not clearly demon- 
strated. In [4], we showed that fuzzy rough sets based on certainty qualifications 
give better approximations than previous fuzzy rough sets [1]. We have not yet 
discuss about the selection of the necessity measure to define fuzzy rough sets. 

In this paper, we demonstrate that, by the level cut conditioning approach, 
we can obtain a new and interesting class of necessity measures which satisfy 

(Rl) Na{B) > 0 if and only if there exists e > 0 such that C (B)e, 

(R2) there exists h* G (0, 1) such that Na{B) > h"" if and only if A C 5, 

(R3) Na{B) = 1 if and only if (A)o C [5]^, 

where [A]^ = {x e X \ iia{x) > h}, (A)/, = {x e X \ }ia{x) > h}, /xa : 
X [0, 1] is a membership function of a fuzzy set A and X is the universal set. 
Moreover, we demonstrate that the proposed necessity measures provide better 
lower and upper approximations in fuzzy rough sets based certainty qualifications 
than often used necessity measures. 

2 Necessity Measures and Level Cut Conditions 

Given a piece of information about unknown variable x that a: is in a fuzzy set 
A, the certainty degree of the event that a: is in a fuzzy set B is evaluated by 

Na{B) =\niI{iiA{x),jiB{x)), (1) 

X 

W. Ziarko and Y. Yao (Eds.); RSCTC 2000, LNAT 2005, pp. 261-268, 2001 . 
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where I : [0, 1]^ [0, 1] is an implication function, i.e., a function such that 

(II) /(0, 0) = 1(0, 1) = 7(1, 1) = 1 and 7(1, 0) = 0. 

Na{B) of (1) is called a necessity measure. Such necessity measures are used in 
approximate reasoning, information retrieval, fuzzy mathematical programming, 
fuzzy data analysis and so on. 

The quality of reasoning, decision and analysis in methodologies based on 
necessity measures depends on the adopted implication function. There are a lot 
of implication functions. The implication which defines the necessity measure 
has been selected by proverbiality and by tractability. The selected implication 
function should be fit for the problem setting. The authors [3] proposed the 
implication function selection by specifying two modifier functions with single 
parameters, g'^ and : [0, 1]^ [0, 1] which are roughly corresponding to 

concentration and dilation modifiers such as linguistic expressions ‘very’ and 
‘roughly’, respectively. In other words, from the viewpoint that a necessity mea- 
sure relates to an inclusion relation, we suppose that Na{B) > h is equivalent 
to an inclusion relation between fuzzy sets A and B with a parameter fi, i.e., 

Na{B) >h^ m/,(A) C Mh{B), (2) 

where the inclusion relation between fuzzy sets is defined normally, i.e., A C B 
if and only if /iA(^) < for all x E X. mh{A) and Mh{B) are fuzzy sets 

defined by = g^(iJ.A{x),h) and = g^ {fXB{x),h). From its 

meaning and technical reason, g^ and g^ are imposed to satisfy 

(gl) *) is lower semi-continuous and p^(a, •) upper semi-continuous for 

all a E [0,1], 

(g2) g^(l,h) = g^(l,h) = 1 and g^{0,h) = g^(0,h) = 0 for all > 0, 

(g3) ^”^(a,0) = 0 and p^(a, 0) = 1 for all a E [0, 1], 

(g4) hi > /i 2 implies ^’^(a,/ii) > g‘^{a,h 2 ) and g^{a,hi) < g^{a,h 2 ) for all 
a E [0, 1], 

(g5) a>h implies g'^{a,h) > g"^{b,h) and g^{a,h) > g^{b,h) for all /z < 1, 
(g6) g"^{a, 1) > 0 and g^{a, 1) < 1 for all a E (0, 1). 

Under the assumption that g^ and g^ satisfy (gl)-(g6), we proved that there 
exists a necessity measure which satisfies (2) and defined by the following impli- 
cation function (see [3]): 

I^{a, b) = sup {h I p”*(a, h) < g^(b, h)}. (3) 

0</i<l 

There are a lot of implication functions which can be represented by (3). 

Table 1 shows pairs (g'^yg^) with respect to S-implications, R-implications 
and reciprocal R-implications (see [2]) defined by continuous Archimedean t- 
norms and strong negations. A continuous Archimedean t-norm is a conjunction 
function which is represented by t{a^b) = f*{f{a) + /(&)) with a continuous 
and strictly decreasing function / : [0,1] [0, +oo) such that /(I) = 0 (see 

[2]), where : [0, +oo) [0, 1] is a pseudo-inverse defined by /*(r) = sup{/i | 
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Table 1. and for 7^, 7^ and P ^ with continuous Archimedean t-norms 



7 


m 


g^{a^ h)^ when h > 0 


g^ (a, h)^ when > 0 


7^ 


— 


max(0, 1 - f{a)/f{n{h))) 


min(l, f{n{a))/f{n{h))) 




< -hoo 


max(0, 1 - /(a)/(/(0) - f{h))) 


min(l, (/(O) - /(a))/(/(0) - f{h))) 


= -f-oo 


a 


/“'(max(0,/(a) -/(/i))) 


j-r— R 


< -hoo 


if{n{a))-f{h))/{m-f{h)) 


min(l,/(n(a))/(/(0) - f(h))) 


= +00 


n(f ‘(max(0, f(n{a)) - f{h))) 


a 


all 


— 


g"^(a, h) = 0, when h = 0 


g^ {a, h) = 1, when h = 0 



f{h) > r}. A strong negation is a bijective strictly decreasing function n : [0, 1] 

[0, 1] such that n(n(a)) = a. Given a t-norm t and a strong negation n, the 
associated S-implication function 7^, R-implication function 7^ and reciprocal R- 
implication 7^“^ are defined by I^{a,b) = n(t(a, n(6))), 7^(a, fe) = ^^Vo<h<i{^ I 
t{a,h) < h} and 7^“^(a, 6) = I^{n{b)^n{a)). 

3 A New Class of Necessity Measures 

Dienes, Godel, reciprocal Godel and Lukasiewitz implications are often used 
to define necessity measures. S-implications, R-implications and reciprocal R- 
implications are also considered since they are more or less generalized implica- 
tion functions defined by t-norms and strong negations. The t-norms often used 
are a minimum operation and continuous Archimedean t-norms. 

None of those often used implication functions satisfies three conditions (Rl)- 
(R3). Indeed, checking a sufficient condition to (R2), i.e., there exists h G (0, 1) 
such that = g^{’,h) for R- and reciprocal R-implications, the equality 

holds when h = 1 but never holds when h G (0,1). For S-implications, the 
equality never holds for h G [0, 1]. Those facts can be confirmed from Table 1 in 
case of continuous Archimedean t-norms. 

We propose a new class of necessity measures which satisfy (R1)-(R3). (Rl)- 
(R3) can be rewritten by using g"^ and g^ as shown in the following theorem. 

Theorem 1. (R1)-(R3) can he rewritten as follows by using g^ and g^ : 

(Rl’) lim/i^+o5"^(a, = 0 and \\vah^-\-o g^ {a^h) = 1 for all a G (0,1), 

(R2’) there exists /i* G (0,1) such that p”^(-, /i) =p^(•,/^) andp”^(-,/i) is bijec- 
tive if and only if /i = /j*, 

(R3’) one of the following two assertions holds: 

(i) 5»’^(a, 1) = 1 for all a G (0, 1) and g^{b^ 1) < 1 for all b G (0, 1), 

(ii) g'^{a^ 1) > 0 for all a G (0, 1) and g^{b^ 1) = 0 for all b G (0, 1). 

Theorem 1 shows that there are a lot of necessity measures which satisfy 
(R1)-(R3). An example is given in the following example. 
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Fig. 1- Transition of and form h = 0 to h = 1 



Example 1. The following satisfies (gl)-(g6) as well as (R1’)-(R3’) 

with h* = 0.5: 



g”^{x,h) 



g^(x,h) 



max ^0, , if h e [0, 0.5), 

min(l,^-^), if h e [0.5,1]. 
f min^l,^^, if h G [0,0.5), 

I max ^0, ^ 2h^ ') ’ ^ ^ 



(4) 

(5) 



where all denominators are treated as +0 when they become 0. Those and 
are illustrated in Figure 1. From (3), the implication function associated 
with this pair {g"^,g^) is obtained as 



I^{a,b) 



1 , 

1 — a b 
2 



if a = 0 or & = 1, 
otherwise. 



( 6 ) 



The necessity measure is defined by using this implication function, which sat- 
isfies (2) with g'^ and g^ defined by (4) and (5). 



4 Fuzzy Rough Sets Based on Certainty Qualifications 

Rough sets have been known as theory to deal with uncertainty mainly caused by 
indiscernibility between objects and have been applied to reduction of informa- 
tion tables, data mining and expert systems. As a generalization of rough sets, 
fuzzy rough sets have been proposed in [1]. In fuzzy rough sets, the equivalence 
relation is extended to a similarity relation and lower and upper approximations 
of fuzzy sets are defined by necessity and possibility measures, respectively. The 
authors [4] has proposed fuzzy rough sets based on certainty qualifications. It is 
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shown that, by using certainty qualifications, we obtain better lower and upper 
approximations than the previous ones. In the previous fuzzy rough sets, lower 
and upper approximations cannot be obtained when a fuzzy partition is given 
instead of a similarity relation. On the other hand, fuzzy rough sets based on 
certainty qualifications give lower and upper approximations even in this case. 

In this section, we introduce fuzzy rough sets based on certainty qualifi- 
cations proposed in [4]. A certainty qualification is a restriction of possible 
candidates for a fuzzy set A by Na(B) > g, where q and a fuzzy set B is 
given. In many cases, the family of fuzzy sets A’s which satisfy Na(B) > q is 
characterized by the greatest element in the sense of set-inclusion, i.e., A± C 
A 2 if and only if (x) < for all a: 6 X. To ensure this, we assume 

(12) I{c,b) < /(a, d), 0<a<c<l, 0<6<d<l, 

(13) I is upper semi-continuous. 

Under the assumptions (12) and (13), we obtain the greatest element A as 
= a[r\{h^ where the implication function I defines the necessity 

measure and a functional a is defined by a[I]{a,b) = supo</t<i{h | I(h,b) > a}. 
All fuzzy sets A’s such that A C A satisfy Na{B) > q. 

A converse certainty qualification is also conceivable. A converse certainty 
qualification is a restriction of possible candidates for a fuzzy set B by Na{B) > 
q, where q and A are given. The family of fuzzy sets B'^s which satisfy Na(B) > q 
is characterized by the smallest element in the sense of set-inclusion. Under 
the assumptions (12) and (13), the smallest element B is obtained as //^(x) = 
^[/](/Lt^(ar),g'), where the implication function I defines the necessity measure 
and a functional ^ is defined by ^[7](a, 6) = info</i<i{/i | I{a,h) > b}. All fuzzy 
sets 5’s such that B D B satisfy Na{B) > q. 

Based on certainty and converse certainty qualifications, two kinds of lower 
and upper approximations can be defined. One is based on certainty qualifica- 
tions and the other based on converse certainty qualifications. As selected in [4], 
lower and upper approximations based on converse certainty qualification are 
adopted also in this paper since they are simpler than the others. 

Given a fuzzy binary relation R which is reflexive, i.e., fiji{x,x) = 1, for all 
a: G X, a fuzzy rough set of a fuzzy set A is a pair (Rn{A), R^{A)) defined by 

fJ‘RuiA){x) = SUp^[I](tl[y]Ax), N[y]AA)), fJ.RO{A)(x) = Tl (x)) (7) 

y 

where n : [0, 1] — ^ [0, 1] is a strong negation. The complement of A, A^ is defined 
by /iA‘=(^) = n.(/i^(a:)). [y\n is a fuzzy equivalent class defined by jiy\^{x) = 

It has been shown that we always have Rn{A) C A C i^^(A) which ensures 
that Rd{A) and i^^(A) are lower and upper approximations of A (see [4]). 
Moreover, Rn (A) is a better lower approximation than the one previously defined 
by a necessity measure and R^{A) is a better upper approximation than the one 
previously defined by a possibility measure. Basic properties of rough sets are 
preserved in fuzzy rough sets based on certainty qualifications under additional 
assumptions on the implication function I (see [4]). 
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An interesting result of fuzzy rough sets based on certainty qualification is 
the fact that we can define lower and upper approximations even when a fuzzy 
partition = {Fi, F 2 , . . . , is given instead of a reflexive fuzzy relation R. A 
fuzzy partition ^ = {Fi, F 2 , . . . , F^^} is a family of fuzzy sets F^’s satisfy 

(PI) inf max jiFi[x) > 

(P2) sup M f, (x)) < 1, for all i, j € {* = 1, 2, . . . , n}, j. 

X 

When such a fuzzy partition is given, lower and upper approximations, 
and ^^{A) are defined as follows: 

^>■^a(A){x) = .^max ^C[/](/XF.(a:), Nf,{A)), h^o(a){x) = n {^t^a(A-){x)) (8) 



5 Necessity Measures and Fuzzy Rough Sets 

Let us discuss about necessity measures for fuzzy rough sets defined by certainty 
qualifications. In order to obtain Rn{A) 7 ^ 0 or ^□(A) 7 ^ 0 (resp. R^ (A) 7 ^ X 
or 7 ^ X), A^[.]^(A) or (resp. %]^(A") or A^f,(A^)) should be 

positive (resp. less than one). From this point of view, we discuss the conditions 
for Na{B) > 0 and Na(B) = 1 so that we know the variety of pairs of fuzzy 
sets (A, F) such that Na(B) g (0, 1). 

First we have the following theorem. 

Theorem 2. Let a t-norm t satisfies 



t{ 6 i, 62 ) > 0, for all 61,62 > 0. (9) 

Then we have the following assertions: 

(i) For necessity measures Na{B) defined by S -implications, we have 

Na{B) > 0 there exists 5 > 0 such that [A]i_£ C (F)^. (10) 

(a) For necessity measures Na{B) defined by R-implications, we have 

Na{B) > 0 ^ there exists 5 > 0 such that . . 

{A)s^ C {B),r for all e' 6 (0,e]. ^ ^ 

(in) For necessity measures Na{B) defined by reciprocal R-implications, we have 

Na{B) > 0 ^ there exists 6 > 0 such that 

[A]i_e/ C [F]i_e/ for all 6 ' 6 (0,e]. ^ ^ 



Equation (9) holds when Hs a minimum operation, an arithmetic product or 
a continuous Archimedean t-norm with /(O) = + 00 . Theorem 2 shows that the 
condition for Na(B) > 0 is quite restrictive when Na{B) is defined by R- and 
reciprocal R-implications with t-norms satisfy (9). Thus, t-norms which violate 
(9), such as a bounded product, should be adopted for necessity measures defined 
by R- and reciprocal R-implications. For Na{B) = 1, we have the following 
theorem. 
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Theorem 3. The following assertions are valid: 

(i) When (9) holds, for necessity measures Na{B) defined by S-implications, 

we have Na{B) = 1 (A)o C [B]i. 

(a) For necessity measures Na{B) defined by R- and reciprocal R-implications 

with continuous t-norms, we have Na{B) = 1 A C B. 

From Theorem 3, we know that the domain of A that Na{B) 6 (0, 1) is large 
when Na{B) is defined by S-implications with t-norm satisfies (9). On the other 
hand, the necessity measures defined by R- and reciprocal R-implications with 
continuous t-norms cannot evaluate the difference between two different inclusion 
relations, e.g., A C B and (A)o Q [B]i (in both cases, we have Na(B) = 1). 
Thus, in those necessity measures, the information Na{B) = 1 is less effective 
to estimate B under known A. 

From the discussions above, among the often used necessity measures, S- 
implications with t-norms satisfying (9) seem to be good for applications to 
fuzzy rough sets based on certainty qualifications. However, those implications 
have the following disadvantage. 

Theorem 4. Consider necessity measures defined by S-implications with t-norms 
satisfying (9). For fuzzy rough sets based on certainty qualification under a fuzzy 
partition ^ = {Fi, F 2 , . . . , Fn}, we have 

(i) ^n{A) is normal only when there exists Fi such that Nf.{A) = 1, where a 
fuzzy set A is said to be normal if there exists x e X such that fiAi^) = 1. 
(a) There exists x E X such that IJ^^o(a){^) = 0 only when there exists Fi such 

that NfM"") = 1- 

From Theorem 3(i), the necessary and sufficient condition of NpiiA) = 1 
is {Fi)o C [A]i and this condition is quite restrictive so that, in many cases, 
{Bi)o C [A]i does not hold. Thus, Theorem 4 implies that, in many cases, ^d{A) 
is not normal and that p^o[A)i^) > 0 for all x € X holds when an S-implication 
with t-norm satisfying (9) is adopted to define the necessity measure. 

Now let us consider necessity measures satisfy (R1)-(R3). From (Rl) and 
(R3), we know that the domain of A that Na(B) g (0, 1) is large in the proposed 
necessity measures. Moreover, we have the following theorem. 

Theorem 5. Consider necessity measures satisfy (R1)-(R3). For fuzzy rough 
sets based on certainty qualification under a fuzzy partition ^ = {Fi, F 2 , . . . , Fn}, 
we have 

(i) ^n(A) is normal if there exists Fi such that NFi(A) > . 

(a) There exists x e X such that //#o(a)(^) = 0 if there exists Fi such that 

NfM"") >^*- 

From Theorem 5, the proposed necessity measures seem to be better than 
the often used necessity measures in application to fuzzy rough sets based on 
certainty qualifications. By the proposed measure, we may obtain better lower 
and upper approximations of a fuzzy set. Let us examine this conjecture in the 
following example. 
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(b) Lukasiewitz implication 




(d) The proposed implication function 



Fig. 2. Lower and upper approximations of A by three implication functions 



Example 2. Let ^ = {Fi, F 2 , . . . , F 5 } and A be a fuzzy partition and a fuzzy 
set given in Figure 2(a). We compare lower and upper approximations 
and ^^{A) with necessity measures defined by three diflTerent implication func- 
tions. We consider Lukasiewitz implication (I^{a,b) = min(l - a -h 1)), 
Reichenbach implication &) = 1 — a -h a&) and a proposed impli- 

cation function defined by ( 6 ). Lukasiewitz implication is an R-implication, a 
reciprocal R-implication and, at the same time, an S-implication defined by a 
strong negation n{a) = 1 — a and a t-norm, more concretely, a bounded product 
t{a, b) = max(0, a-\-b — l) which does not satisfy (9). On the other hand, Reichen- 
bach implication is an S-implication defined by a strong negation n(a) = 1 — a 
and a t-norm, more concretely, an arithmetic product t(a, b) = ab which satisfies 
(9). The lower and upper approximations ^n{A) and ^^(A) with respect to the 
three implication functions are shown in Figure 2(b)-(d). From these figures, 
the proposed implication function gives the best approximations among three 
implication functions. 
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Abstract. In this paper we present a new approach for approximating 
concepts in the framework of formal concept analysis. We investigate two 
different problems. The first, given a set of features B (or a set of objects 
A), we are interested in finding a formal concept that approximates B 
(or A). The second, given a pair (A, 5), where A is a set of objects 
and .B is a set of features, we are interested in finding a formal concept 
that approximates (A,B). We develop algorithms for implementing the 
approximation techniques presented. The techniques developed in this 
paper use ideas from fuzzy sets. The approach we present is different 
and simpler than existing approaches which use rough sets. 



1 Introduction 

Formal concept analysis (FCA) is a mathematical framework developed by Wille 
and his colleagues at Darmstadt/Germany that is useful for representation and 
analysis of data [1 1] . A pair consisting of a set of objects and a set of features com- 
mon to these objects is called a concept. Using the framework of FCA, concepts 
are structured in the form of a lattice called the concept lattice. The concept 
lattice is a useful tool for knowledge representation and knowledge discovery [4]. 
Formal concept analysis has also been applied in the area of conceptual modeling 
that deals with the acquisition, representation and organization of knowledge [6] . 
Several concept learning methods have been implemented in [1,4,5] using ideas 
from formal concept analysis. 

Not every pair of a set of objects and a set of features defines a concept 
[11]. Furthermore, we might be faced with a situation where we have a set of 
features (or a set of objects) and need to find the best concept that approximates 
these features (or objects). For example, when a physician diagnosis a patient, he 
finds a disease whose symptoms are the closest to the symptoms that the patient 
has. In this case we can think of the symptoms as features and the diseases as 
objects. Another example is in the area of information retrieval where user’s 
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DAAH04-96- 1-0325, under DEPSCoR program of Advanced Research Projects 
Agency, Department of Defense. 
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query can be understood as a set of features and the answer to the query can 
be understood as the set of objects that possess these features. It is therefore 
of fundamental importance to be able to find concept approximations regardless 
how little information is available. 

In this paper we present a general approach for approximating concepts that 
uses ideas from fuzzy set theory. We first show how a set of features (or objects) 
can be approximated by a concept. We then extend our approach for approxi- 
mating a pair of a set of objects and a set of features. Based on our approach, 
we present efficient algorithms for concept approximation. 

The notion of concept approximation was first introduced in [7, 8] and further 
investigated in [9,10]. All these approaches use rough sets as the underlying 
approximation model. In this paper, we use fuzzy sets as the approximation 
model. This approach is simpler and the approximation is presented in terms 
of a single formal concept as compared to two in terms of lower and upper 
approximations [7-10]. Moreover, the concept approximation algorithms that 
result from using a fuzzy set approach are simpler. 

The organization of this paper is as follows. In Section 2 we give an overview 
of FCA results that we need for this paper. In Section 3, we show how to ap- 
proximate a set of features or a set of objects. In Section 4, we show how to 
approximate a pair of a set of objects and a set of features. A numerical example 
explaining the approximation ideas is given in Section 5. Finally, a conclusion is 
drawn in Section 6. 

2 Background 

Relationships between objects and features in FCA is given in a context which 
is defined as a triple (t7, M^I)^ where G and M are sets of objects and features 
(also called attributes), respectively, and 1 C G x M. An example of a context 
is given in Table 1 where an “X” is placed in the Ah row and jth column to 
indicate that the object at row i possesses the feature at column j. If object 
g possesses feature m, then {g^m) G I which is also written as gim. The set 
of features common to a set of objects A is denoted by (3 {A) and defined as 
{mG M I gImWg G A}. Similarly, the set of objects possessing all the features 
in a set B C M is denoted by a[B) and given hj {g E G \ gIm Mm G i^}- 
A formal concept (or simply a concept) in the context (t7,M, /) is defined as a 
pair (A,R) where A C G^ B C M ^ /^(^) = 33 and a{B) = A, A is called the 
extent of the concept and B is called its intent. For example, the pair (A, R) 
where A = {4,5,8,9,10} and B = {c, d, /} is a formal concept. On the other 
hand, the pair (A, B) where A = {2,3,4} and B = {/, h\ is not formal concept 
because a{B) ^ A. A pair (A, R) where A G G and B C M which is not a formal 
concept is called a non- definable concept [10]. The Fundamental Theorem of FCA 
states that the set of all formal concepts on a given context with the ordering 
(Ai, Ri) < (A 2 , B 2 ) iff Ai C A 2 is a complete lattice called the concept lattice of 
the context [11]. The concept lattice of the context given in Table 1 is shown in 
Figure 1 where concepts are labeled using reduced labeling [2]. The extent of a 
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concept C in Figure 1 consists of the objects at C and the objects at the concepts 
that can be reached from C going downward following descending paths towards 
the bottom concept Ci. Similarly, the intent of C consists of the features at C 
and the features at the concepts that can be reached from C going upwards 
following ascending paths to the top concept C 23 . The extent and intent of each 
concept in Figure 1 are also given in Table 2. 



Table 1. Example of a context 
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3 Approximating a Set of Features or a Set of Objects 

Since approximating a set of objects works analogous to approximating a set of 
features, we only show how to approximate a set of features. Let S C M be 
a set of features. Our goal is to find a formal concept C intent of which is as 
similar to B as possible. The concept C is then said to approximate B. Define 
a membership function that gives a measure of how well C approximates 

B as follows ^ 



fc{B) 



i^nlntent(C) 


■ - 1 - ■ 


a(i^)n Extent (a) 


BuIntent(C) 


\ 


Extent (C) 



2 



The range of fc{-^) is the interval [0, 1]. fc{X) =0 when B and a{B) are disjoint 
from the intent and extent of (7, respectively, /a(^) = 1 when B = Intent (C) 
and therefore, a[B) = Extent ((7). In general, the closer the value of /c(^) to 1, 
the greater the similarity between B and the intent of (7. Conversely, the closer 
the value of /c(L^) to 0, the less the similarity between B and the intent of (7. 

To approximate a set of features S, we find a formal concept C such that the 
value of /a(^) is the closest to 1. If we find more than one concept satisfying the 
approximation criteria, we choose only one such concept. In this case, we say 



^ One can, as well, think of fc{B) as a similarity measure between B and (7. 
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that these concepts equally approximate B. In some applications, for example 
medical diagnosis, we may need to present the user with all the concepts that 
equally approximate B, 

The pseudo-code for the algorithm for approximating a set of features is given 
in Algorithm 1. The input to this algorithm is the set of all formal concepts on a 
given context (G, M, /), which we denote by L, and a set of features B C M . ^ Li 
denotes the ith concept in L. The output is a formal concept C that approximates 
B and the value of fc{B). The idea of the algorithm is similar to that of finding 
a maximal element in a set. Finding the value of /c(^) requires evaluating a[B) 
which requires time equals to [2]. ^ The value of a[B) is assigned to the 

variable Obj outside the do loop to improve the efficiency of Algorithm 1. The 
running time complexity of Algorithm 1 is 0(|L| + |i^||(T|). 

Algorithm 1. Approximate a set B of features 

C^Li 

Obj ^ a{B) 

II Assign fc{B) to maxvalue 

maxvalue ^ Evaluate-Membership(0bj,i^,(7) 

\L\ 

for (i ^ 2;i < n;i++) 

if ( Evaluate-Membership(Obj,S,L^) > maxvalue ) then 

maxvalue ^ Evaluate-Membership(0bj,i^,(7) 
end if 

end for 

Answer ^ C and maxvalue 

The function Evaluate- Member ship takes as arguments a set of objects A, 
a set of features B^ and a formal concept C. It returns the degree of member- 
ship or similarity between the set of features B and the concept C when called 
with arguments a[B)^ B and C as is done in Algorithm 1. Likewise, Evaluate- 
M ember ship returns the degree of similarity between the set of objects A and 
the concept C when called with arguments A, /?(A) and C. It also returns the 
degree of similarity between the pair (A, 5) and the concept C when called with 
arguments A, B and (7. 

Algorithm 2. Evaluate-Membership(AyByC) 



Return 



AnExtent(c) 




Bnlntent(c) 


AuExtent(c) 


'T ■ 


Bulntent(c) 



2 



^ The most efficient algorithm for finding all formal concepts of a context is called the 
Next algorithm [3]. This algorithm is also described in [2]. 

^ |X| is the cardinality of the set X which is the number of elements of X. 
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The code for the function Evaluate- Member ship requires the evaluation of 
set intersections and set unions which can be implemented very efficiently, in 
constant time, using bit vectors and bit operations. 

4 Approximating a Pair of a Set of Objects and a Set of 
Features 

Suppose that we are given a set of objects A and a set of features B. We call the 
process of finding a concepts C such that the extent of C is as similar to A as 
possible and intent of C is as similar to B as possible concept approximation. We 
also say that the concept C approximates the pair (4, B). Define a membership 
function fc{A^B) that indicates how well the formal concept C approximates 
the pair {A^ B) as follows 



AnExtent(C)| . |Enlntent(C)| 

AuExtent(a)| |i^ulntent(C)| 

2 ' 

The expression \A Pi Extent ( (7) |/|4 U Extent(C)| indicates how similar A is to 
Extent ((7) and the expression \B Pi Intent((7)|/|i^ U Intent ((7) | indicates how 
similar B is to Intent((7). It is also easy to see that the range of fc{A^B) is 
the interval [0, 1]. fc{A^ B) = 0 when (7 and (4, B) do not have any element in 
common and fc{A^B) = 1 when {A^B) is equal to the formal concept (7. The 
closer the value of fc{A^B) to 1, the greater the similarity between the pair 
{A^ B) and the formal concept (7. Conversely, the closer the value of fc{A^B) 
to 0, the less the similarity between (4, B) and (7. 

Algorithm 3 gives the pseudo-code for approximating a pair (A, B). The input 
to Algorithm 3 is the set L of all formal concepts on a given context ((7,M, /), 
a set of objects A and a set of features B. The output is a concept (7 that 
approximates (A, 5) and the value of fc{A^B) which is used as an indication 
of how well (7 approximates (A, 5). The idea of Algorithm 3 is similar to that 
of Algorithm 1. The running time complexity of Algorithm 3 is 0(|L|). 

Algorithm 3. Approximate a pair (A, B) of a set of objects and a set of features 

C^Li 

II Assign fc{A^B) to maxvalue 

maxvalue ^ Evaluate — Membership[A^ 5, (7) 

\L\ 

for (i ^ 2;i < n;i++) 

if [Evaluate — M ember ship[A^ B^ Li) > maxvalue ) then 

maxvalue ^ Evaluate — M ember ship[A^ 5, (7) 
end if 
end for 

Answer ^ C and maxvalue 



fc{A,B) = 
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5 Numerical Example 

In this section we give a numerical example of the approximation ideas discussed 
in Sections 3 and 4. 

Consider the context (G, M, I) given in Table 1 which gives information about 
10 objects and 12 features that the objects can have. This context has 23 formal 
concepts which were generated using the algorithm Next described in [2] . Table 
2 gives details about executing Algorithm 1 on the set of features B = 
and about executing Algorithm 3 on the pair [Ai , 5i) = ({4, 6, 9}, {e,f,h}) which 
is a non-definable concept because ^ The first column in each row is 

a label of the concept under consideration. The second and third columns give 
the extent and intent of the concept. The fourth column contains the value of 
/a({f,h,i}) which is the degree of similarity between {f,h,i} and C. Finally, the 
fifth column gives the value of fc{Ai^Bi) the degree of similarity between the 
pair {A^B) and (7. 

Considering the values of /cr({f,h,i}) in the fourth column, Algorithm 1 re- 
turns ({5, 6, 7}, {f,h,i,x}) as the formal concept approximating the set of features 
{f,h,i} with similarity value of 0.8750. Similary, considering the values in the fifth 
column. Algorithm 3 returns the formal concept ({4, 6}, {e, /, /i, /}) as a result of 
approximating the non-definable concept ({4, 6, 9}, {e,f,h}) with similarity value 
of 0.7083. 

6 Conclusion 

This paper presents a new approach for approximating concepts in the frame- 
work of formal concept analysis. This approach is based on a fuzzy set theoretic 
background and is different from the previous approaches which use ideas from 
rough set theory. We gave algorithms for approximating a set B of features 
and for approximating a pair {A^B) of a set of objects and a set of features. 
The time complexity for the algorithm for approximating a set of features is 
0(|L|T |i^||(7|) where L is the set of all formal concepts and G is the set of all 
objects. The time complexity for our algorithm for approximating a pair (A, B) 
is 0(|L|). 
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Table 2. Results of Running Algorithm 1 on {f,h,i} and Algorithm 3 on 
({4,6,9},{e,f,h}). 
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Intent 


/c({f,h,i}) 
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Cii 


{4,6,10} 


{e,f,l} 


0.2000 


0.5000 


Cl 2 


{4,6,9,10} 


{f,l} 


0.2083 


0.5000 


Cl 3 


{4,5,8} 


{c,d,e,f,h} 


0.2667 


0.4000 


Cl 4 


{4,5,8,10} 


{c,d,e,f} 


0.1667 


0.2833 


Cl 5 


{4,5,8,9,10} 


{c,d,f} 


0.1714 


0.2667 


Cl 6 


{4,5,6,8} 


{e,f,h} 


0.4500 


0.7000 


Cir 


{4,5,6,8,10} 


{e,f} 


0.2917 


0.5000 


Cis 


{2,4,5,8} 


{d,f,h} 


0.3333 


0.3333 


Cl 9 


{2,4,5,8,9,10} 


{d,f} 


0.1875 


0.2679 


C20 


{2,3,4,5,6,7,8} 


{f,h} 


0.5476 


0.4583 


C21 


{2,3,4,5,6,7,8,9,10} 


{f} 


0.3333 


0.3333 


C'22 


{1,2,4,5,8,9,10} 


{d} 


0.0556 


0.1250 


C23 


{1,2,3,4,5,6,7,8,9,10}! 


{} 


0.1500 


0.1500 
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Abstract. Tn a ])revioiJs paper we have (ieveloped an ax io malic charac- 
terization of approximation operators defined by the classical diamond 
and box operator of modal logic. The paper presented contains the anal- 
ogous results of approximation operators which are defined hy using the 
concepts of fuzzy rougli sets. 

Keywords. Roiigli sets, fuzzy rough sets, approximation operators. 



1 Introduction and Fundamental Definitions 

In the paper presented we shall use the well-known concepts of the “classical” 
crisp and fuzzy set theory, respectively. 

For definiteness we recall the following definitions. 

Let U be an arbitrary non-empty crisp set. The power set of U and the twofold 
Cartesian product of U by itself are denoted by PC and U x U, respectively. 
Binary relations on U are sets of the form H C U x U . We denote [x, y] G fi also 
by xRy. 

For an arbitrary binary relation R. on U we define 

Definition 1. 1. R is said to he reflexive on U —d^s Vx(x G U xRx). 

2. R is said to he transitive on U =def VxVyV 2 :(xi?.y A yRz xRz). 

3, R is said to be symmetric on U =def VxV?/(xi?.7/ yRx). 

4^ R is said to be an equivalence relation on U =def ^ is reflexive, transitive, 

and ^symmetric on U . 

For mappings ^ :FU FU we define 

Definition 2. L 0 is said to be embedding on ¥U 
=def VA'(X C U X C <P{X)). 

This research was supported by the Deutsche Forschungsgemeinschaft as part of the 
Collaborative Research Center ‘‘Computational lntelligence(531)” 

W. Ziarko and Y. Yao (Eds.): RSCTC 2000, LNAl 2005, pp. 277-285, 2001. 
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2. is said to be closed on FU 

yx{x C U ^ <i>{0{X)) C <i>{X)). 

3. $ is said to be monotone on WU 

=<i«r yx\fY{X CY CU ^ 0(X) C <P(Y)). 

4- is said to be symmetric on U 

'ix'iy{x, ij ev /\y e <X’({.k}) X e 

5. ^ is said to be a closure and a symmMric closure operator on FU 
=<3ef fulfils the item,s 2, 3 and f 2, 3, respectively. 

6. ^ IS said to be strongly compact on FU 

=def yxyy{X CU Ay e <t>{X) 3x()(x,) e X Ay E ^({-^o})))- 

Fuzzy sets F on U are mappings of the form F : U — ^ (0? 1) where (0, 1) 
denotes the set of all real numbers r with 0 < r < 1. We define FFU =def 
{F\F : U — > (0, 1)} and call FFU the fuzzy power set of U. Kor arbitrary fuzzy 
sets F and G on U we put F H G =def E U F[x) < G{x)). 

In section 3 of this paper we shall consider binary fuzzy relations 
S : U X U ^ (0, 1) and (crisp-fuzzy) mappings F : PL'' IFFL-'b To formu- 
late the results of section 3 we have to modify definitions 1 and 2 as follows. 
For defining some kinds of fuzzy transitivity we fix two arbitrary real functions 
r, 7T : (0, 1) X (0, 1) (0, 1). 

Definition 3. 1. S is said to be fuzzy reflexive on U 

=def ^xfx E U ^ S{x^ x) = 1). 

2. S is said to be fuzzy standard transitive on U 

=def 'ixYy'iz{x, y,z eU min(S'(«, y), S{y, z)) < S{x, z)). 

3- S is said to be fuzzy conjunction-like r -transitive on U 
=def 'ixYy'iz{x,y,z eU ^ t{S{x,ij), S{y, z)) < S{x,z)). 

S is said to be fuzzy mplication-like ir-transitive on U 
-def 'ix'iy'iz{x, y,z e U S{x, y) < n{S{y, z), S{x, z))). 

5. S is said to be fuzzy symmetric on U 
=def VxVj/(x, y eU ^ S{x, y) = S{y, i;)), 

6. S is said to be a fuzzy standard equivalence relation on U 
=def S fulfils the items f 2, and 5. 

Remark 1. The items 3 and 4 lead to further kinds of fuzzy equivalence relations. 
These cases will be considered in a following paper, in particular, with respect 
to the investigations in section 3. 

Now, vre are going to modify definition 2. Assume F : FU —t IKIPt/. For 
expressing the closedness of F we need the iteration F{F[X)) where X CU and 
zEU. 

Definition 4. 'P{ll'{X]){z) =def sup{min(l?'(X)(i/),'i^({j/})(z))|j/ G U}. 

Definition 5. 1. ^ is said to be fuzzy embedding on FU 

=def VXVj/(X CU ^y&X^ = 1). 

2. 'P is said to he fuzzy closed on PC/ 

=def VX(X CU^ Q P{X)). 
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3. ^ is said to be fuzzy monotone on [Pf/,FFL"] 

=def VA'VK(X C C ^ V/(X) □ 
f. iP" is said to be fuzzy symmetric on U 

Va;V 2 /(a;,»/ G U >//({3;})(2/) = V/({y})(a;)). 

5. ^ is said to be a fuzzy closure and a fuzzy symmetric closure operator 
on [IPty, ¥\W] =def fulfils the items f 2, 3 and f 2, 3, fj respectively. 

6. W is said to be fuzzy strongly compact on [Pf/, FPff] 

=def VA'Vy(X C ^./ Ay G U ^ G X AV/(X)(y) < ^/({^„})(y))). 

2 An Axiomatic Characterization of Rough Set Based 
Approximation Operators 

For a better understanding the definitions and results of section 3 we recall the 
fundamental definitions and main results of [13] in the present section where we 
also correct some slight mistakes. 

In the paper [13] we started our investigations by recalling the definitions 
of the upper and the lower rough approximation {H)X and [/^]X, respectively, 
where R is an equivalence relation on U and C U . 

Because of lacking space we considered only the construct {R)X. By using 
concepts of modal logic the definition of {R)X invspires to introduce the following 
approximation operator 0PER(7?) where R is an arbitrary binary relation on U 
and X C U . 

Definition 6. OPER(R)(A) =def {y|y C U A G X A [x, y] G 

Obviously, we have OPER(R) : FU FU. For the following investiga- 
tions it will be very important to describe the images OPER(R) without using 
the mapping OPER. To this end we generate a binary relation REL(^) where 
<P:FU ^FU. 

Definition 7. REL(^) =def {[x, ij]\x, yeU Ay E ^({a:})}. 

Then without any assumption to R C U x U we obtain the following 

Theorem 1. For every R C U X U , REL(OPER(R)) = R. 

This theorem means that R can be uniquely reconstructed from OPER(R). In 
other words, OPER is an injection from {R.|iZ C U xU} into :FU -^FU}. 
Nowq we a.re going to describe the set {OPER(i^)|R CUx U}. 

Lemma 1. For every R QU xU , OPER(R) is monotone and strongly compact 
on FU. 

For formulating the following two lemmas and the following theorem we start 
our constructions with a mapping 0 : FU FU ^ in contrast to theorem 1 . 

Lemma 2. If0 is monotone on then 

for every X C U, OPER(REL(0))(A) C 0{X). 
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Lemma 3. If0 is strongly compact on FU then 

for every X C U, 0{X) C (DPKR,(RH[.(0))(X). 

'rheorem 1 , lemma 2, and lemma 3 imply the following final theorem where 
0 : Pf/ ^ FU. 

Theorem 2. 

1. If 0 is monotone and strongly compact on FU then for every X C U j 
OPPK,(RKL(0))(X) =d>{X). 

2. The mapping OPER is a bijection from the set of all binary relations on U 
onto the set of all monotone and strongly compact operators on PLE 

3. The mapping REL is the inversion of the mapping OPER and vice versa. 

Furthermore, in [13] we investigated how properties of binary relations on U 
are translated into properties of operators on FU by the mapping OPER and 
vice versa by the mapping REL. Here we present the following slightly corrected 
and complemented results, respectively. 

Theorem 3. 1. R is reflexive on U zj^OPER(R) is embedding on FU . 

2. If d> is embedding on Pt/ then REL(0) is reflexive on U . 

3. If REL(0) is reflexive on U and 0 is monotone and strongly compact on FU 
then 0 is embedding on FU . 

Theorem 4, 1. R is transitive on U iff OFEK[R) is closed on FU. 

2. If0 is closed and m,onotone on FU then REL(0) is transitive on U. 

3. If REL(0) is transitive on U and 0 is m,onotone and strongly compact on 
FU then <P is closed on FU . 

Theorem 5. UR is refl,exive and transitive on U iff OF F\^.{R) is a closure 
operator on FU . 

2. If 0 is a closure operator on FU then REL(0) is reflexive and transitive 
on U . 

3. //REL(0) is reflexive and transitive on U and 0 is monotone and strongly 
compact on Pt^ then 0 is a closure operator on PL". 

Theorem 6. 1. R is an equivalence relation on U iff OFER[R) is a symmetric 

closure operator on FU . 

2. If d> is a symm,etric closure operator on FU then REL(0) is an equivalence 
relation on U . 

3. 7/REL(0) is an equivalence relation on U and 0 is monotone and strongly 
compact on FU then 0 is a symmetric closure operator on FU . 

3 An Axiomatic Characterization of Fuzzy Rough Set 
Based Approximation Operators 

Assume 

S :U xU ^ {0,1) and X CU. 

The upper and lower S'-fuzzy approximation {S)X and [S]X , respectively, of 
the crisp set X are defined as follows where y E U (see also [1-9]). 
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Definitions. L {{S)X){y) s\ip{mm{S{x,y)^'yj^{x))\x E U}, 

2. ([.9]A')(y) inf{max(1 - 5(a:, 2 /),^y(a;))|j; G U}. 

In literature sometimes the constructs {S)X and [ 6 ]^ are called Fuzzy Hough 
Sets, in particular, if S' is a fuzzy standard equivalence relation on U . 

Because the functions min and max are mutually dual we obtain for every 
S : IJ X IJ (0, \), X C U, and y e U, 

((5>X)(2/) = 1 - ([S]V)(2/) 

and 

([5]X)(y) = l-((5)V)(2/) 

i.e. {S)X and [S]X are also mutuall^^ dual. 

Assume 

r,^:(0 ,l)x(0 , 1 )^(0 , 1 >. 

Generalizing ideas presented in [10] we define 

Definition 9. L {{S,r)X){y) =def sup { r(S(ir, y), -- 3 ^.( 0 :)) |x G U}, 

2. ([S, 7r]A')(y) =def mf{7r{S{x,y),^^(x))\x G /./}. 

In the papers [10, 11] we have stated some properties of (S, r)X and [S, 7 t]X), 
in particular, if r is a t-norm and tt is a certain kind of implication. Furthermore, 
we underlined that {S,r)X and [S, 7 t]X) are mutually dual if r and tt fulfil the 
equation 

VrVs(r, s G (0, 1) 7r(r, 5 ) = 1 — r(r, 1 — s)). 

Because of lacking space, in the following we shall consider only the construct 
{S)X and the fuzzy approximation operator which can be generated by {S)X, 
The remaining cases described above wdll be investigated in a following paper. 

Definition 8 inspires to introduce the following ^Tuzzy- Rough- Like*’ approx- 
imation operator PROPER as a mapping from ¥U into FPf/. 

Assume S : U x U ^ (0, 1), X C {7, and y G U . 

Definition 10. FROPER(S')(X)(t/) =def {{S)X){y), 

The following lemma simplifies the definition of PROPER. 

Lemma 4. FROPER(S')(X)(i/) = sup{5(a:, y)\x G X}. 

Proof. Applying the definition of the char act erivStic function 7^ of the crisp set 
X CU and the equation min(r, 0) = 0 for every r G (0, 1). □ 

Obviously, we have FR0PER(5) : FU FPLE As in section 2 we ask the 
question how to describe the images FR0PER(5) without using the mapping 
OPER. To this end by an arbitrary mapping IF : FPf/ we generate the 

binary fuzzy relation FREL(]P) on U as follow^s wdiere x^y G U . 



Definition 11. FREL(?F)(a:, y) =def lP({3:})(y). 
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Then without any assumption we get the following theorem. 

Theorem 7. T For every S :U xU ^ (0,1), FREL(FROPER(S')) = 5. 

2. F KOPEK is an injection from, the set of all binary fuzzy relations on U into 
the set {F\^ : FU FPf/}. 

3. FREE is the inversion o/FROPER with respect to 
|KKOPEK(5)|5 : U x U (0, ])}and vice versa. 

This theorem means that S can be uniquely reconstructed from FROPER(S'). 
Now, w^e are going to describe the set {FROPER(S') |5 : U x U ^ {0,1)}. 

Lemma 5. For every S : U x U 0? mapping KKOPEK(5) is fuzzy 

monotone on [Pf/, FPt/]. 

Proof We have to show 

(1) VXVy(X CY CU FROPER(5)(X) □ FROPER(S')(y )). 

From X C Y C LI we get 

{S{x,y)\x e X} C {S{x,y)\x e Y}, 

hence (1) holds. □ 

Now, we are going to prove that KKOPEK(5) is fuzzy strongly compact on 
[PfqFPff]. To show this we additionally need the concept of submodality which 
we have already introduced in [12], but here in the following slightly modified 
form. Assume S : U x U ^ (0, 1). 

Definition 12. S is said to be submodal with respect to its first argument 
=def VXVy{X CU Ay eU 3a:o(a?o G X Asup{S{x,y)\x e X} = S{xo,y))). 

Lemma 6. If S is subm,odal with respect to its first argument then FROPER(5') 
is fuzzy strongly compact on [Pf/, FP[/]. 

Proof. Assume 

(2) X CU and y G U. 

We have to prove 

(:i) G A A KKOPEK(5 )(A)(t/) < FKOPEK(5)({;r„})(T/)). 

Because S is submodal with respect to its first argument, we have 

(4) 3xq{xo G X Asup{5(;r,y)|ir G A} = S{xo,y)), 
hence by definition of FRO PER 

(5) FROPER(A)(i/) =S{xo,y). 

Furthermore, by definition of K KOPEK we get 

(6) FROPER({a;o})(j/) = sup{S{x,xj)\x G {«o}} = S{xo,i/), 

hence (5) and (6) imply (3). □ 
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By the following lemmas 7 and 8 we solve the problem how the set 
{KKOPh^H{5)|5 : U x U ^ (0,1)} can be characterized without using the 
mapping PROPER. 

Lemma 7. If ^ is a fuzzy monotone mapping on [P7^, FPL"] then for every 
X CU, FR0PER(FREL(1F))(X) 

Proof We have to prove that for every y E U ^ 

(7) FR0PER(FREL(>Z^))(X)(2/) < >?(X)(2/). 

By lemma 4 w^e have 

(8) FROPER(FREL(i/^))(X)(y) = sup{FREL(if^)(ar, C b' }, 
hence by dehnition of FREL(l/^) 

(9) FROPER(FREL(^Z>^))(X)(y) = sup{iF({x})(y)lx E X}. 

Obviously, it is sufficient to show 

(10) <«?(X)(2/) ifarex. 

But (10) holds because {x} C X and ^2^ is a fuzzy monotone mapping on 
[IP t/ , IKlPt/ ] , hence 1 em m a 7 h ol d s . □ 

Lemma 8. If^ is fuzzy strongly compact on [Pb", FPbf] then for every X G Uj 

V/(X) □ KROPER(FREL(«//)(X). 

Proof We have to prove that for every y E U , 

(11) < FROPER(FREL(ib))(X)(y). 

Because lb is fuzzy strongly compact on [Pf/, FPb/] we get 

(12) 3xo{xo ex A W{X){y) < ^{{xo}){y)). 

To prove (11) it is sufficient to show 

(13) m^o}){y) < FROPER(FREL(lb))({^o})(2/) 

because the mapping FROPER(FREL(lb)) is fuzzy monotone on [Pt/, FPb/] (see 
lemma 5). 

Fiirthenuore, by definition of PROPER we have 

(14) K RO P E R ( E R E L (i//) ){{xo}) {y) 

= sup{FREL(lb)(a:, y)\x E {^co}} 

= FREL(0^)(^o,b), 



hence by definition of FREE, 

(15) FREL(lb)(xo,y)=lb({xo})(y), 

hence (13) holds. 



□ 
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Theorem 8. 1. If ^ is fuzzy monotone and fuzzy strongly compact on 

[1P(/, IKIW] then for every X C U, KR(DPKH,(KKKl.(i//))(X) = V/(X). 

2. PROPER is a bijection from the set of all binary fuzzy relations on U onto 
the set of all mappings ^ which are fuzzy m.onotone and fuzzy strongly com- 
pact on [P^7, FPt/]. 

3. FREL is the inversion o/ PROPER and vice versa. 

Now, analogously to section 2 we investigate how special properties of bi- 
nary fuzzy relations S on U will be translated into properties of operators 
^ : ]PU FP^7 by the mapping PROPER and vice versa by the mapping 
FREE. 

Remark 2. Because of lacking space we omit the proofs of all the following the- 
orems. 

Theorem 9. /. S is fuzzy reflexive on U iff PROP EH, (5) is fuzzy embedding 

on FU . 

2. If^ is fuzzy embedding on FU then FREL(lZ^) is fuzzy reflexive on U . 

3. If KREL(»/^) is fuzzy refl.exive on U and W is fuzzy strongly compact on 
[P^7, FPf./] then ^ is fuzzy embedding on FU . 

Theorem 10, 1. S is fuzzy standard transitive on U FROPER(S') is fuzzy 

closed on [P[/, FP?7]. 

2. //V-^ is fuzzy closed and fuzzy monotone on [P^/, IKPE] then PREb(V^) is fuzzy 
standard transitive on U . 

3. //FREL(!i^) is fuzzy standai^d transitive on U and W is fuzzy monotone and 
fuzzy strongly compact on [Pf/, FPf/] then ^ is fuzzy closed on [Pf/, FPf/]. 

Theorem 11, 1. S is fuzzy reflexive and fuzzy standard transitive on U iff 

FROPER(S') is a fuzzy closure operator on [Pf/, FPb"]. 

2. //V-^ is a fuzzy closure operator on [P^./, ¥\W] then PREL(V-^) is fuzzy reflexive 
and fuzzy standard transitive on [Pf/, FPf/]. 

3. If FREL(l^") is fuzzy reflexive and fuzzy standard transitive on [PE^, FPf/] 
then W is a fuzzy closure operator on [Pf/, FPf/]. 

Theorem 12, /. S is a fuzzy standard equivalence relation on IJ iff 

FROPER(iS) is a fuzzy symmetric closure operator on [PL'", FPf/]. 

2. If^^ is a fuzzy symmietric closure operator on, [1P(/, IKPf./] then PREb(i/^) is a 
fuzzy standard equivalence relation on U . 

3. If FREL(^I^) is a fuzzy sta?jdard equivalence relation on U and ^ is fuzzy 
monotone and fuzzy strongly compact on [Pf^, F\W] then is a fuzzy sym- 
metric fuzzy closure operator on [Pf/, FP?7]. 

Acknowledgements. The author wishes to thank Claus-Peter Alberts for his help 
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Abstract. This work introduees a generalization of the algorithm LEM3, an 
ineremental learning system of produetion rules from examples, based on the 
Boolean Approximation Spaee introdueed by Pawlak. The generalization is 
supported in the Stoehastie Approximation Spaee introdueed by Wong and 
Ziarko. In this paper, stochastic limits in the preeision of the upper and lower 
approximations of a elass are addressed. These allow the generation of certain 
rules with a eertainty level p (0.5<p<l). Also the modifieations in LEM3 
neeessary in order to handle examples with missing attribute values are 
introdueed. 



1 Introduction 

The main characteristics of the LEM3 system [4] are: 

- It is inductive, supervised, incremental and learns with full memory (using all 
previous examples for later learning). 

- Learned knowledge is expressed in classification rules formed by conjunctions of 
attribute- value {a-v) pairs. 

- The rule learning strategy follows a nonincremental learning program, LEM2, in 
generating minimal rules. LEM2 solves the inconsistency using the Rough Sets 
theory introduced by Pawlak [1,3]. Two examples are inconsistent if they have the 
same values in all the condition attributes, but have different classes. When a class, 
A, contains inconsistent examples, it means that X is not definable by the set of all 
condition attributes. The basic idea is to replace X by its upper and lower 
approximations generated by the set of all condition attributes. Eor sets of 
inconsistent examples, two types of rules can be generated: possible rules learned 
from the upper approximation of each class, and certain rules learned from the 
lower approximation of each class. One feature that differentiates LEM3 from 
LEM2, is the use of the Global Data Structure to capture knowledge learned from 
previous examples. In LEM3, this structure is proposed to support the incremental 
updating of upper and lower approximations, based on Boolean Approximation 
Space (BoolAS). 
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The main contribution of this paper is the generalization, based on the Stochastic 
Approximation Space (StocAS), of the algorithm for incremental updating of upper 
and lower approximations proposed in LEMS. The StocAS [2] allows the use of 
stochastic limits in determining the upper and lower approximations. Thus 
classification rules with possible or certain ownership to a class, supported by a 
certainty level |3 can be generated (0.5<= (i< =1). The use of the StocAS is proposed in 
[4] as future work. 



2 Global Data Structure (GDS) in Boolean Approximation Space 
(BoolAS) 

A BoolAS, A, is an ordered pair (U,R), where U is 3. non-empty set called universe 
and R is an equivalence relation on U called indiscernibility relation. Eor each 
subset X in U, X is characterized by a pair of sets, the upper approximation and 
lower approximation of X in A, which are defined as: RX={xGU\[x\nX7^0}3ind 
RX= {xG L/|[x]j^eA} where [x\ denotes the equivalence class of R containing x. 

The GDS stores information learned from previous training examples such as: 
consistency of a-v pairs, instances denoted by a-v pairs, a-v pairs relative to each 
class, and an instance-count for each class (which can be used to calculate conditional 
probabilities that determine the measurement of “goodness” of a-v pairs). The GDS 
consists of three tables: the Block Table (BT), the Relevant a-v Pair Table (Ra-vT) 
and the Lower and Upper Approximation Tables (Li&UAT). 

In BT examples represented by integers are stored. The table indices are a-v pairs 
and classes. A block is a set of examples indexed by an a-v pair. The BT is used to 
store consistencies of a-v pairs and examples presented to the learning algorithm. An 
a-v pair is said to be consistent if it is associated with only one class, otherwise, it is 
inconsistent. A block is consistent if all the examples in the block are of the same 
class. 

The Ra-vT contains a-v pairs that are relevant to each class. An a-v pair is 
relevant to a class if it can describe at least one instance of the class. Eor each class, a 
low-list and an up-list of relevant a-v pairs are stored in this table. An a-v pair that is 
relevant and consistent with a class is stored in the low-list of the class. An a-v pair 
that is relevant and inconsistent with a class is stored in up-list of the class. The lists 
are ordered by the “goodness” of relevant a-v pairs. The goodness of an a-v pair, p, 
with respect to a class, X, is defined as the conditional probability of an instance, e, 
being in class X, given that e is in the block denoted by p. The lists are used to 
minimise the search space of rule-generating procedure. The BT and Ra-vT tables are 
updated with each example. Eor each a-v pair in a new example, there are three steps 
in updating the tables: 1) Checking the consistency of an a-v pair; 2) Inserting the 
example into the BT with the index (attribute, value, class) and marking the block as 
consistent or inconsistent, based on the result of the first step; 3) Inserting the a-v pair 
into the proper a-v pair list of the Ra-vT. 

The Li&UAT contain three sets of examples for each class: the set of examples that 
belong to the class, and two sets corresponding to the RX and RX of the class. Every 
time a new example is presented to the learning system, is added to the set 
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corresponding to its class. Also the RX and RX of all the classes are updated with the 
information provided by the new example. If the new example is consistent, it is only 
necessary to update the RX and RX of the class associated with the new example. If 
however, the new example is inconsistent, it is necessary to update all the RX and/or 
RX of the classes which this inconsistency influences. 

The following problems found in [4] are addressed in this study: updating the 
goodness of an a-v pair, and managing unknown values of attributes. 



3 Generalization of LEM3 to the Stochastic Approximation Space 
(StocAS) 

A StocAS is a triplet A=(U,R,P), where U is 3. universe, R is an equivalence relation 
on U, and P is a probability measurement of subsets of U. The lower and upper 
approximation of a subset X in U are defined by using the concept of stochastic 
approximation with a certainty level |3 (0.5<|3<1). The /?-upper and /?-lower p- 
approximations of A in v4 are defined in [4] as: 7 ^={xg t/|P>P(A|[x])>=0.5} and 
^={xG L/|P(A|[x])>=P} where P(A|[x]) is the conditional probability defined as 
P(An[x])/P([x]). 

In the RX definition we use the parameter a instead of the constant 
0.5: 7 ^={xg t/|P(A|[x])>=a}, and we establish a=(l-P) to maintain coherence with 
the meaning of the certainty level p. In [4], Chan also considers a superior limit P in 
the precision of the RX: 7^={xg t/|p>P(A |[x])>=0.5}. In this work, this superior 
limit is not used, so as to conserve coherence with the definition of RX according to 
the Rough Sets theory. The region that defines the previous expression is in fact, the 
boundary region [3]. The generation of possible rules from the boundary region 
examples, instead of from the RX examples, has the disadvantage that the rules 
obtained only classify the boundary region examples, so that generalization is lost. On 
the other hand it has the advantage that the rules generated do not contain rules that 
have already been generated from the RX, thereby eliminating the later process of 
rules simplification. 

In StocAS, an example, is consistent with a class, X, if the conditional 
probability P(A|[eJ)>=P, where [ej is the intersection of the blocks denoted by the a-v 
pairs of the example. The case of p=l, is the consistency condition in the BoolAS. In 
BoolAS, an example can be either consistent or inconsistent. In StocAS an example 
may be consistent with one class, yet inconsistent with another; or it could be 
inconsistent with several classes. 

In StocAS, an a-v pair is consistent with a class, X, if the conditional probability 
P(A|[(a,v)])>=P, where {{a,v)'\ is the block denoted by the a-v pair. The Ra-vT 
updating procedure, in LEMS, was modified to include this new concept of 
consistency of an a-v pair with a class, and to insert the relevant a-v pairs of a class, 
X, in the low-list of that class. In BoolAS, an a-v pair is consistent if it is associated 
with only one class. In StocAS, an a-v pair can begin by being consistent with a class, 
X, stop being consistent with this class, become consistent with a different class. It is 
necessary to consider that in BoolAS the consistency concept of an a-v pair is used to 
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simplify the updating process of the RX and RX of a class. In BoolAS, the fact that an 
example contains a consistent pair with its class is reason enough to affirm that the 
example is also consistent. In StocAS this is not so easy; it is necessary to verify that 
all the pairs of the example are consistent with the class. Finally, this modification of 
the algorithm was discarded as it did not contribute any major simplification to the 
approximations updating process, and added great complexity to the Ra-vT updating 
process. The new updating procedure of RX and RX is the following. 

Procedure UPDATE_STOCHASTIC_APPROXIiy[ATIONS 
/* Input: an example e± of class X */ 

/* Output: updated lower and upper ^-approximation of class X 
and of the rest of classes. */ 

begin 

Add to examples set of class X; 

Y=Intersection of all blocks denoted by a-v pairs in ei, 
including examples with an unknown pair; 

If P(X|Y) >= P 

then 

Add Y to lower p- approximation of X; 

Add Y to upper p- approximation of X; 
for each class X' ^ X do 

If p(X'|y) < P y p(X'|y) >= a 

then 

Delete Y from lower p- approximation of X' ; 

Add Y to upper p-approximation of X ' ; 
else if P (X' I Y) < a 

then 

Delete Y from lower p-approximation of X' ; 

Delete Y from upper p-approximation of X' ; 

endif 

endif 

endfor 

else /* P(X|Y) < P */ 

Add Y to upper p- approximation of X; 

For each class X' X do 

If P(X'|Y) < P and P(X'|Y) >= a 

then 

Delete Y from lower p-approximation of X' ; 

Add Y to upper p- approximation of X' ; 
else if P(X' I Y) < a 

then 

Delete Y from lower p-approximation of X' ; 

Delete Y from upper p-approximation of X' ; 

endif 

endif 

endfor 

endif 

End; /* Procedure UPDATE STOCHASTIC APPROXIMATIONS */ 
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4 Conclusions 

A generalization of algorithm LEMS for learning produetion rules from eonsistent and 
ineonsistent examples, ineluding the treatment of attributes with unknown values has 
been introdueed. This extension, based in the Stoehastie Approximation Spaee 
introdueed by Wong and Ziarko, is suggested in [4] as future work. The i?-upper and 
R-\owqv P-approximations of X are defined in [4] as: RX={xeU\ P>P(A|[x])>=0.5} 
and f/|P(A|[x])>=P}. In the RX definition we use the parameter a instead of 

the eonstant 0.5, and a= (1-P) in order to maintain eoherenee with the eertainty level 
p. In [4] is also eonsidered a superior limit p in the preeision of RX. This work has 
been done without this superior limit p in order to eonserve the eoherenee with the 
definition of RX, aeeording to the Rough Sets theory. 

A value a>0 means the elimination of the ineonsistent examples of some upper 
approximations. In other words, the elimination of rules that have the least eonditional 
probability. This ean be eonsidered as a vertieal purifieation of the rules set. If 
ainereases, the total number of rules diminishes. A value of a=0.5means the 
inelusion of the ineonsistent examples only in the RX of the elass to whieh at least 
half of the examples belong. If there is no elass that fulfills this eondition, the 
examples are not ineluded in any RX. When parameter P diminishes, the ineonsistent 
examples whose eonsisteney with a elass is greater than P are eonsidered in the RX of 
that elass. In other words, certain examples with a certainty level P are eonsidered. 
This makes the rules generated from RX more general beeause they inelude more 
examples. Thus, the rules are eompaeted, and generally the number of rules 
diminishes. 
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Abstract. Ever since Data Mining first appeared, a considerable amoun- 
t of algorithms, methods and techniques have been developed. As a result 
of research, most of these algorithms have proved to be more effective 
and efficient. For solving problems different algorithms are often com- 
pared. However, algorithms that use different approaches are not very 
often applied jointly to obtain better results. An approach based on the 
joining of a predictive model (rough sets) together with a link analysis 
model (the Apriori algorithm) is presented in this paper. 



Keywords: Data Mining models joining, Rough Sets, Association rules. 

1 Introduction 

The Rough Set methodology provides a way to generate decision rules. Some 
condition values may be unnecesary in a decision rule. Thus it is always desirable 
to reduce the amount of information required to describe a concept. A reduced 
number of condition attributes results in a set of rules with higher support. On 
the other hand, this kind of rules are easier to understand. The concept of reduct 
is used when there is a need for reducing the number of attributes, but this is a 
computational expensive process. 

One way to construct a simpler model computed from data, easier to understand 
and with more predictive power, is to create a set of simplified rules [11]. A 
simplified rule (also refered to as minimal rule or kernel rule) is one in which 
the number of conditions in its antecedent is minimal. Thus, when dealing with 
decision rules, some condition values can be unnecessary and can be dropped to 
generate a simplified rule preserving essential information. In [7] an approach to 
simplify decision tables is presented. Such an approach consists of three steps: 
1) Computation of reducts of condition attributes; 2) Elimination of duplicate 

This work has been partially supported by UPM under project ” Design of a Data 
Warehouse to be integrated with a Data Mining system” 
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rows; 3) Elimination of superfluous values of attributes. This approach to the 
problem is not very useful because both the computation of reducts and the 
superfluous equivalence classes are NP-Hard. 

Many algorithms and methods have been proposed and developed to generate 
minimal decision rules, some based on inductive learning [6], [8], [3] and some 
other based on Rough Sets theory [11], [10], [12], [2], [9]. Rough sets theory 
provides a sound basis for the extraction of qualitative knowledge (dependecies) 
from very large relational databases. 

Shan [11] proposes and develops a systematic method for computing all minimal 
rules, called maximally general rules, based on decision matrices. 

Based on rough sets and boolean reasoning, Bazan [2] proposes a method to 
generate decision rules using dynamic reducts, stable reducts of a given decision 
table that appear frequently in random samples of a decision table. 

Skowron [12] proposes a method that when applied over consistent decisions 
tables make it possible to obtain minimal decision rules. Based on the relative 
discernibility matrix notion. 

An incremental learning algorithm for computing a set of all minimal decision 
rules based on the decision matrix method is proposed in [10]. 

On the other hand, different algorithms have been proposed to calculate reducts 
based on Rough Sets Theory. However, flnding the minimal reduct is a NP- 
hard problem [13], so its computational complexity makes application in large 
databases imposible. In [4] a heuristic algorithm to calculate a reduct of the 
decision table is proposed. The algorithm is based on two matrices that are 
calculated using information from the Positive Region. In Chen and Lin [5] a 
modifled notion of reducts is introduced. 

In this paper we propose to execute prior to Rough Set methodology the Apriori 
algorithm in order to discover strong dependencies that can, in general, be useful 
to reduce the original set of attributes. 

Observe that this approach wil not generate a minimal reduct. Nevertheless it 
is important to note that as a side effect it is possible to obtain strong rules to 
classify the concept that will be reflned using the rough set methodology. 

The rest of the paper is organized as follows. 

Section 2: Rough Sets and association rules introduction. 

Section 3: describes the new approach. 

Section 4: results discussion and future work. 

2 Preliminaries 

2.1 Rough Sets Theory 

The original Rough Set model was proposed by Pawlak [7]. This model is con- 
cerned with the analysis of deterministic data dependencies. According to Ziarko 
[14] Rough Set Theory is the discovery representation and analysis of data regu- 
larities. In this model, the objects are classifled into indiscernibility classes based 
on pairs (attribute, values). 

The following are the basic concepts of the rough set model. 
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Let OB be a non-empty set called the universe, and let IN D be an equivalence 
relation over the universe OB^ called an indiscernibility relation which represents 
a classification of the universe into classes of objects which are indiscernible or 
identical in terms of the knowledge provided by the given attributes. The main 
notion in Rough Sets Theory is that of the approximation space which is for- 
mally defined as A = [OB^IND). 

Equivalence classes of the relation are also called elementary sets. Any finite 
union of elementary sets is refered to as a definable set. Let’s take X C OB 
which represents a concept. It is not always the case that X can be defined 
exactly as the union of some elementary sets. That is why two new sets are 
defined: Apr[X) = {o G OB/[o] C X} will be called the lower approximation 
Apr{X) = {o e OB/[o] n X 0} will be called the upper approximation. 
Any set defined in terms of its lower and upper approximations is called a rough 
set. 

2.2 Information Systems 

The main computational effort in the process of data analysis in rough set theory 
is associated with the determination of attribute relationships in information 
systems. An Information System is a quadruple: S = (OR,AT, U, /) where: 

— OB is a set of objects 

— AT is a set of attributes 

— V = [jVa being Va the values of attribute a 

— f :OBx AT 

2.3 Decision Tables 

Formally, a decision table S is a quadruple S = (OR, O, R, U, /). All the concepts 
are defined similarly to those of information systems; the only difference is that 
the set of attributes has been divided into two sets, C and R, which are conditions 
and decision respectively. 

Let R be a non empty subset of O U R, and let x^y be members of OB. x^y 
are indiscernible by R in 5 if and only if f{x^p) = f{y^p) for all p G R. Thus 
R defines a partition on OB. This partition is called a classification of OB 
generated by R. Then for any subset R of OuR, we can define an approximation 
space, and for any X C OB the lower approximation of X in S' and the upper 
approximation of X in S' will be denoted as B{X) and R(A), respectively. 

2.4 Association Rules 

The purpose of association discovery is to find items that imply the presence of 
other items. An association rule is formally described as follows: 

Let / = {ii, ^2 • • • be a set of literals called items. 

Let R be a set of transactions, each transaction T C T 

An association rule is an implication of the form X ^ Y where X C I and 




294 M.C. Fernandez-Baizan et al. 



Y C I and X D Y = 0. The rule X ^ Y holds in the transaction set D with 
confidence c if c% of transactions in D that contain X also contain Y . The rule 
X ^ Y holds in the transaction set D with support s if s% of transactions in 
D contain X UY , 

Given a set of transaction D the problem of mining association rules is to gen- 
erate all association rules that have support and confidence greater that the 
user-specified minimum support [minsup) and minimum confidence [minconf). 
In order to derive the association rules two steps are required: 1) Find the large 
itemsets for a given minsup; 2) Compute rules for a given rninconf based on the 
itemsets obtained before. 

3 The Cooperative Algorithm 

Algorithm input: 

— T the input decision data table. Note that the set of attributes AT will be di- 
vided in condition (eventually antecedent) and decision (described attribute) 

— minconf the minimun confidence for rules 

— n maximun number of condition allowed in the antecendent of the rules 

We will assume that input table (70 is one that is either discrete or has been 
discretized in a pre-processing stage. We will also assume that the input table 
has been binarized. Formally expressed: 

Let AT be a set of attributes, let A G AT be an attribute. Let V = {ai, U 2 , . . . , 
be the set of values of A. 

The binarization of A will yield n attributes Ai, A 2 , . . . , A^ such that for any 
o G OB if /(o, A) = ai then /(o, A^) = 1 if i = j /(o, A^) = 0 if i ^ j 
The process that will be performed is as follows: 

1. Execute the Apriori algorithm being T the input data table. Internally the 
algorithm will calculate the best min support and k (size of large itemsets) 
depending on the nature of T . We will call this new set of rules Tassoc^ 

2. Delete all those rules in Tassoc in which the decision attribute occurs as part 
of their antecedent. 

3. If VidA: such that Ak Di then Ak is a superfiuos attribute so that it can 
be removed from AT. Remove also such rules from Tassoc^ 

4. Analysis of the rules in the new Tassoc- This set of rules contains two kind 
of rules that we have called strong classification rules and rneta-rules. The 
former is composed by all those rules in Tassoc in which the consequent is 
some Di with confidence > minconf. The latter is a set containing rules 
that will allow ns to reduce the number of condition attributes. 

— Include strong classification rules in the output data mining rule set 
— Those associations rules in rneta-rules that only contain condition at- 
tributes have to be taken into account as they highlight dependencies 
among condition attributes. We will call this set of rules Tred 

5. Reduce AT taking into account the rules in Tred 

6. Execute the Positive Region algorithm to obtain a set of rules that will be 
included in output data mining rule set. 
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4 Conclusions 

This approach provides a sound basis for the definition of a new cooperative al- 
gorithm to obtain comprehensible rules, while avoiding the computational com- 
plexity of classical predictive methods. 
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Abstract. In this paper, we present a particular study of the negative 
factors that affect the performance of university students. The analysis is 
carried out using the CAI (Conjuntos Aproximados con Inceriidumbre) 
model that is a new revision of the VPRS ( Variable Precision Rough 
Set) model. The major contribution of the CAI model is the approximate 
equality among knowledge bases. This concept joined with the revision 
of the process of knowledge reduction (concerning both attributes and 
categories), allow a significant reduction in the number of generated rules 
and the number or attributes per rule as it is showed in the case of study. 



1 Introduction 

One of the first approaches for extracting knowledge from data, without statisti- 
cal background, was the Rough Sets Theory (RS), introduced by Z. Pawlak [1, 2]. 
It arose from the necessity of having a formal framework to manage imprecise 
knowledge originated from empirical data. One of advantages of this model is 
the lack of necessity about preliminary or additional knowledge about the data 
analysed. 

Tools development based on this approach and its further application to real 
problems has shown some limitations of the model, being the most remarkable 
the incapability of managing uncertain information. An extension to this model 
was proposed by W. Ziarko [3, 4]. Derived from it, without any additional hy- 
potheses, it incorporates uncertainties management and is called Variable Pre- 
cision Rough Sets (VPRS). 

The CAI {Conjuntos Aproximados con Inceriidumbre) model [5, 6, 7] is de- 
rived from the VPRS model. As the VPRS model, the CAI model is also suited 
to deal with uncertain information but with the aim of improve the classifica- 
tion power in order to induce stronger rules. Approximate equality of different 
knowledge bases is defined to reach this goal and is an attempt for introducing 
uncertainty at two different levels: the constituting blocks of knowledge (elemen- 
tary categories) and the overall knowledge. 

The paper is organised as follows. Sections 2 and 3 briefly introduce basic 
concepts on RS and VPRS, respectively. Section 4 introduces the approximate 
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equality concept and the knowledge reduction process under the CAI model. In 
Section 5, a case of study is presented to show the improvement of the CAI model, 
in terms of a significant reduction in the number of generated rules and the 
number of attributes per rule. Concretely, the application is about the analysis 
of negative factors that affect the performance of students in a Databases subject 
at University of Vigo. Finally, the conclusions of this work are presented. 



2 Rough Sets Fundamentals 

The Rough Set framework allows the treatment of imprecise knowledge. Im- 
precision is defined from the fact that knowledge granularity infiuencing the 
reference universe definition originates indiscernibility [1, 2, 8]. Basics notions 
are the knowledge base and the lower and upper- approximation. 

Let U be, a finite and not empty set representing a reference universe and 
let be R an equivalence relation where R C U x then we call knowledge base 
K to the pair K = (?7, R). 

The knowledge base K establishes a partition of the universe into disjoint 
categories (i.e. a classification), denoted by U/R and where their elements, [x]r, 
are equivalence classes of R. Objects that belonging to the same class are indis- 
tinguishable under relation fl, called indiscernibility relation. Since these classes 
represent elementary properties of the universe expressed by means of knowledge 
K, they are also called elementary categories. 

Objects belonging to the same category are not distinguishable, which means 
that their membership status with respect to an arbitrary subset of the domain 
may not always be clearly definable. This fact leads to the definition of a set 
in terms of lower and upper approximations. The lower approximation is a de- 
scription of the domain objects which are known with certainty to belong to the 
subset of interest, whereas the upper approximation is a description of the ob- 
jects which possibly belong to the subset. Any subset defined through its lower 
and upper approximations is called a rough set. 

Formally, given a knowledge base K = (C/, -R), and being X C t/, the R- 
lower approximation and R-upper approximation (denoted by RX and RX, 
respectively), are defined as follows: 

RX = e U/R : [x]r C X} 

RX = |J{[a:]R € U/R : [x]r n X # 0} 

Once defined these approximations of X, the reference universe U is divided 
in three different regions: the positive region POSr{X), the negative region 
NEGr{X) and the boundary region BNDr{X), defined as follows: 



POSr{X) = RX 
NEGr(X) = U -RX 
BNDr{X) = rx-rx 
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The positive region or lower approximation of X includes those objects that 
could be unmistakably classified as X members using knowledge R. Similarly, 
the negative region include those objects that belong to -X (the complementary 
set of X). Finally, the boundary region is the indiscernible area of the universe 
U. 



3 Variable Precision Rough Sets 



The basic concept introduced in the VPRS model is the relationship of majority 
inclusion. Its definition lies on a criterion c(X, T) called relative classification 
error, defined as follows: 



c(x,r) = i- 



card{X fl Y) 
card{X) 



( 1 ) 



being X and Y two subsets of the universe U. 

According to this, the traditional inclusion relationship between X and Y is 
defined in the following way: 



X CY iff c(X, Y) = 0 (2) 

Rough inclusion arises by the relaxation of the traditional inclusion when 
an admissible error level is permitted in classification. This error is explicitly 
expressed as j3. Then, the rough inclusion relationship between X and F, is 
defined as: 

iff c(X,Y)<p (3) 

Ziarko established as a requisite that at least the 50% of the elements have 
to be common elements, then being 0</3< 0.5. 

Under this assumption, Ziarko redefines the concepts of lower and upper 
approximation. Let be a knowledge base K = (U,R) and a subset X C U, the 
/3-lower approximation of X, (denoted by RpX) and the /3-upper approximation 
(denoted by R^X) are defined as follows: 

R^X = e U/R : X} 

e X) < 1 - /?} 

Alike in the Pawlaks model, the reference universe U could be divided in 
three different regions. These are the positive region POSr^ 0 {X), the negative 
region NEGr^is{X) and the boundary region BNDr^is{X), defined as follows: 

POSr^p(X) = R^X 
NEGr^p(X) = u -r^x 
BNDr^^{X) = RpX - R^X 

These new definitions originate a reduction of the boundary region allowing 
more objects to be classified as X members. The relationship among the VPRS 
model and the RS model is established considering that the RS model is a 
particular case within the VPRS when /3 = 0. 
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4 The CAI Model 

As mentioned above, the major contribution of the CAI model is the approximate 
equality and its use in the definition of knowledge equivalence [5, 6, 7]. 

In the RS theory two different knowledge bases K = ([/, P) and K' = ([/, Q) 
are equivalent, denoted by K if U/P = U/Q^ where llfP and U/Q are 

the classifications induced by knowledge P and Q, respectively. So, P and Q are 
equivalent if their constituting blocks are identical. 

In the CAI model, uncertainty is introduced at two different levels: the con- 
stituting blocks of knowledge (elementary categories) and the overall knowledge, 
through the relationship of majority inclusion. So that, two different knowledge 
bases P and Q are equivalent or approximately equal, and denoted by P Q, 
if the majority of their constituting blocks are similar. 

Approximate equality forces to redefine the concepts implied in the knowl- 
edge reduction. As the Rough Set theory, the reduct and core are the basics 
concepts of the process of knowledge reduction. 

In order to introduce these concepts it is necessary to define the indiscerni- 
bility relation 7A^Z>(R). If R is a family of equivalence relationships, then HR 
is also an equivalence relationship, denoted by /iVD(R). 

Formally, the concept of approximate equality is defined as follows. Given two 
different knowledge bases K = (U,P) and K' = ([/, Q), we say that knowledge 
P is a /?-specialization of knowledge Q, denoted by P Cp Q, iff 

_ card{{[x]p e IND{P) / 3[x]q e IND{Q) : [x]p [x]q}) 
card{IND{P)) 

Finally, we say that the two knowledge bases are approximately equal if the 
following property is held, 

P Q P C/3 Q A Q C/? P (5) 

Let R be a family of equivalence relations and let i? 6 R- We will say 
that R is y3-dispensable in R if IND(R) IND(Ei — {/?}), otherwise R 
is /3-indispensable in R. The family R is /3-independent if each i? e R is /3- 
indispensable in R; otherwise R is /3-dependent. 

Proposition 1. ^R is j3 -independent and P C R, then P is also j3 -independent. 

Q C P is a /?-reduct of P if Q is /3-independent and IND{Q) IND(P). 
The set of all /3-indispensable relations in P will be called the /3-core of P, and 
will be denoted by CORE^CP). 

Proposition 2. COREfs(P) = f]REDp(P)j where REDp(P) is the family of 
all fi-reducts ofP. 

The /3-indispensable relations, /3-reducts, and /3-core can be similarly defined 
relative to an specific indiscernibility relation Q, or elementary category. For 
example, the relative /3-core is the union of all /3-indispensable relations respect 
to the relation Q. 
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An important kind of knowledge representation system is the decision table 
formalism, which is broadly used in many applications. A decision table specifies 
what decisions (actions) should be undertaken when some conditions are satis- 
fied. Most decision problems can be formulated employing decision tables and 
therefore, this tool is particularly useful in decision making applications. 

The concepts defined in CAI model allow us to confront the reduction of an 
information system (decision table) of the similar way at Pawlak’s methodology 
[2] with the exception in the treatment of the inconsistencies. In the CAI model 
the inconstancies are necessaries because they provide useful information in the 
generation of rules. Therefore, we follow two steps [6]: 

- Computation of ^^-reducts of condition attributes, which is equivalent to 
eliminate some columns from the decision table. 

- Computation of ^-reducts of categories, which is equivalent to eliminate 
superfluous values of attributes. 

In this methodology there are not elimination of duplicate rows neither elim- 
ination of inconsistencies, because this information is useful for the rule. In 
consequence, and due to the inconsistency treatment, the induced decision rules 
may have an associate error in any case no greater than the admissible error (3, 

5 Case of Study 

This study deals with the analysis of negative factors that affect the students 
performance in a Databases subject. The data were obtained from Computer 
Science students of the University of Vigo [6, 9]. The purpose of this paper is to 
show the classification power of CAI model. 

We analyse 16 condition attributes (statistic and personal data) and 1 deci- 
sion attribute (student’s mark) relating to 118 students. If the decision attribute 
value is 0, the student fails. 

Firstly, we apply the method of Sect. 4, searching the reducts for the condition 
attributes and reduction of superfluous values of attributes for each reduct. The 
result of this process is the generation of decision rules. 

We need to select the ^ value for generating decision rules. This is an empir- 
ical process. In order to select the rules we use the following criteria: 

- Rules that cover more objects. 

- Rules with less number of condition attributes. 

Under these criteria the obtained results, considering f3 = 0.25, are: 

IF (Family Environment (Rural) AND (Access Mode (FPII)) 

THEN Failed [The rule covers 15 objects with an error of 0.20] . 

IF (Access Mode (FPII) AND Father Studies (Primary)) 

THEN Failed [The rule covers 11 objects with an error of 0.18]. 

IF (Father Studies (University)) 

THEN Failed [The rule covers 8 objects with an error of 0.25]. 

IF (Family Environment (Rural) AND Father Studies (Bachelor’s degree)) 

THEN Failed [The rule covers 8 objects with an error of 0.12]. 
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We would like to emphasise that only three of the analysed attributes take 
part in the strongest rules. The following conclusions can be derived from these 
rules: 

— The negative factors that affect the students performance in a Databases 
subject are: 

• The family environment of the student if it is rural. 

• The access mode of the student if it is the FP-II access. This access mode 
is reserved in Spain for students which do not study bachelor’s degree. 

- Father Studies influence in the result too, however it is not conclusive, prob- 
ably due to the lack of analysed objects and the high number of attributes. 

The Table 1 shows the results that have been obtained by the CAI model 
with three different values of (3. Analysing the results, we can notice the following 



Table 1. Results obtained by the CAI Model 



13 values 


Rules 


Number of attrs. 


Number of objects 


Error 


o 

o 

II 


1476 


1 attr.: 1.08% 

2 attr.: 27.23% 
>3 attr.: 71.69% 


1 obj.; 75.0% 

2 obj.: 16.0% 

3 obj.: 4.8% 
>4 obj.: 4.2% 


Error=0.0: 100% 


(3 = 0.25 


174 


1 attr.: 10.5% 

2 attr.: 77.5% 
>3 attr.: 12.0% 


1 obj.: 74.15% 

2 obj.: 11.5% 

3 obj.: 2.3% 
>4 obj.: 12.05% 


Error=0.0: 88.5% 
D<Error< 0: 11.5% 


/3 = 0.35 


102 


1 attr.: 25.5% 

2 attr.: 68.6% 

3 attr.: 5.9% 


1 obj.: 62.6% 

2 obj.: 9.9% 

3 obj.: 13.8% 
>4 obj.: 13.7% 


Error=0.0: 75.5% 
D<Error< 0: 24.5% 



advantages of the CAI model: 

- Reduction of the number of the condition attributes in the rules. With /3 > 0 
the majority of rules have 2 attributes. 

- The number of generated rules decreases when [3 parameter grows. With 
^ = 0, 1476 rules are generated, with > 0 the number of rules is reduced 
over a 90%. 

- Generation of rules with a higher classification power. The strongest rules 
were obtained with ^ > 0. 

- We can control the uncertainty degree that can be introduced in our classi- 
fication. 

These advantages are agree with the perspectives of the Rough Set model 
presented by Pawlak in [10]. 
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6 Conclusions 

In this paper we present a new model based on VPRS model, referred to as 
the CAI model. As starting point, a revision of knowledge equivalence, and the 
definition of the concept of approximate equality are introduced. These concepts 
have led to a new redefinition of the knowledge reduction process. The conse- 
quences of this redefinition are the generation of stronger rules, in the sense, 
that the induced rules are more significant and with a less number of attributes 
in theirs antecedents. 

Moreover, we have done a first approach to the study of negative factors that 
affect the performance of university students. We want to emphasise the few 
studies about this problem which have been carried out in Spain. We believe 
that CAI model can be a good starting point for further and deeper studies. 

Finally, our research group have developed a software library that allows 
us the use of CAI model in practical cases. Now, we are interested in mecha- 
nisms for an automatic generation of Bayesian networks from data sets based on 
CAI model, using detection of dependence-independence attributes to build the 
network [11]. 
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Abstract. Induction of decision rules within the dominance-based rough set 
approach to the multiple-critaia sorting decision problem is discussed in this 
paper. We introduce an algorithm called DOMLEM that induces a minimal set 
of generalized decision rules consistent with the dominance principle. An 
extension of this algorithm for a variable consistency model of dominance 
based rough set approach is also presented. 



1. Introduction 

The key aspect of Multiple -Criteria Decision Analysis (MCDA) is consideration of 
objects described by multiple criteria representing conflicting points of view. Criteria 
are attributes with preference- ordered domains. For example, if decisions about cars 
are based on such characteristics as price and fuel consumption, these characteristics 
should be treated as criteria because a decision maker usually considers lower price as 
better than higher price and moderate fuel consumption more desirable than higher 
consumption. Regular attributes, such as e.g. COlouT and country of production are 
different from criteria because their domains are not preference- ordered. 

As pointed out in [ 1,6] the Classical Rough Set Approach (CRSA) cannot be 
applied to multipl-chterla decision problems, as it does not consider criteria but only 
regular attributes. Therefore, it cannot discover another kind of inconsistency 
concerning violation of the dominance principle, which requires that objects having 
better evaluations (or at least the same evaluations) cannot be assigned to a worse 
class. For this reason, Greco, Matarazzo and Slowinski [1] have proposed an 
extension of the rough sets theory, called Dominance -based Rough Set Approach 
(DRSA), that is able to deal with this inconsistency typical to exemplary decisions in 
MCDA problems. This innovation is mainly based on substitution of the 
indisc ernibility relation by a dominance relation. In this paper we focus our attention 
on one of the major classes of MCDA problems which is a counterpart of multiple- 
attribute classification problem within MCDA: it is called multiple- criteria sorting 
problem. It concerns an assignment of some objects evaluated by a set of criteria into 
some pre-defined and preference-ordered decision classes (categories). 

Within DRSA, due to preference- order among decision classes, the sets to be 
approximated are, so-called, upward and downward unions of decision classes. For 

W. Ziarko and Y. Yao (Eds.): RSCTC 2000, LNAI 2005, pp. 304-313, 2001. 
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each decision class, the corresponding upward union is composed of this class and all 
better classes. Analogously, the downward union corresponding to a decision class is 
composed of this class and all worse classes. The consequence of considering criteria 
instead of regular attributes is the necessity of satisfying the dominance principle, 
which requires a change of the approximating items from indiscemibility sets to 
dominating and dominated sets. Given object x, dominating set is composed of all 
objects evaluated not worse than x on all considered criteria, while dominated set is 
composed of all objects evaluated not better than x on all considered criteria. 
Moreover, the syntax of DRSA decision rules is different from CRSA decision rules. 
In the condition part of these rules, the elementary conditions have the form: 
"evaluation of object x on criterion q is at least as good as a given level" or 
"evaluation of object x on criterion q is at most as good as a given level". In the 
decision part of these rules, the conclusion has the form: "object x belongs (or 
possibly belongs) to at least a given class" or " object x belongs (or possibly belongs) 
to at most a given class". 

The aim of this paper is to present an algorithm for inducing DRSA decision rules. 
This algorithm, called DOMLEM, is focused on inducing a minimal set of rules that 
cover all examples in the input data. Moreover, we will show how this algorithm can 
be extended to induce decision rules in a generalization of DRSA, called Variable 
Consistency DRSA model (VC-DRSA). This generalization accepts a limited number 
of counterexamples in rough approximations and in decision rules [2]. 

The paper is organized as follows. In the next sections, the main concepts of DRSA 
are briefly presented. In section 3, the DOMLEM algorithm is introduced and 
illustrated by a didactic example. Extensions of the DOMLEM algorithm for VC- 
DRSA model are discussed in section 4. Conclusions are grouped in final section. 



2. Dominance-based Rough Set Approach 

Basic concepts of DRSA are briefly presented (for more details see e.g. [1]). It is 
assumed that examplary decisions are stored in a data table. By this table we 
understand the 4-tuple S=<IJ,Q,VJ>, where is a finite set of objects, g is a finite set 
of attributes, V = and Vq is a domain of the attribute q, and/ (7xg->Fis a 

total function such Xhdd fix,q)^Vg for every q^Q,, xeU. The set Q is, in general, 
divided into set C of condition attributes and set D of decision attributes. 

Assuming that all condition attributes ^gC are criteria, let Sq be an outranking 
relation on V with respect to criterion q such that xS^^ means “x is at least as good as 
y with respect to criterion q'\ Eurthermore, assuming that the set of decision attributes 
D (possibly a singleton {d}) makes a partition of U into a finite number of classes, let 
Cl={Clt, t€:T}, T={1,..., n}, be a set of these classes such that each x^U belongs to 
one and only one C/^gC/. We suppose that the classes are ordered, i.e. for all r,s^T, 
such that r>s, the objects from Cb are preferred to the objects from C7,. The above 
assumptions are typical for consideration of a multiple-criteria sorting problem. 

The sets to be approximated are upward union and downward union of classes, 

respectively: C/f = U>rC/^. C7f = U<;C/^> t=\,...,n. 
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Then, the indiscemibility relation is substituted by a dominance relation. We say 
that X dominates y with respect to PqC, denoted by xDpy, if xS^y for all q^P. The 
dominance relation is reflexive and transitive. Given P<^C and x^V, the “granules of 
knowledge” used for approximation in DRSA are: 

- a set of objects dominating x, called P-dominating set, o'p yP^p^}, 

- a set of objects dominated by x, called P~dominated set. Dp (x)={>^g U\ xDpy), 
Using £)p (x) sets, Pdower and P -upper approximation of Cl} are defined as: 

P{CIt)={x^H-- Dp{x)^Cl }, P{cr^)= ^ort=\,...,n. 

Analogously, P-lower and P-upper approximation of cf are defined as: 

^(C7“)={^eU: D~p{x)(^cf}. ~P{Cl^)= \jDp(^), for t=\,.,.,n. 

x^af 

The P-boundaries (P-doubtful regions) of Clf and Clf are defined as: 

Bnd,Clf)=P{cf,)-P{Cl ),Bn^Clf)=P{Cl^)-PiCl-)Jort=\,,.,^^^ 

These approximations of upward and downward unions of classes can serve to 
induce generalized then,,,'' decision rules. For a given upward or downward 
union Clf or Clf , s,t^T, the rules induced under a hypothesis that objects belonging 
to P(C7 ) or to P,(Clf) are positive and all the others negative, suggest an 
assignment of an object to “at least class C7” or to “at most class C7/’, respectively. 
They are called certain D>- (or D^)~decision rules because they assign objects to 
unions of decision classes without any ambiguity. Next, if upper approximations differ 
from lower ones, approximate D><- decision rules can be induced under a hypothesis 
that objects belonging to the intersection P(Clf)r\P(Clf) {s<t) are positive and all 
the others negative. They suggest an assignment of objects to some classes between 
Cls and Clt, Yet another option is to induce D>- (or D ^-possible decision rules 

instead of approximate ones under the hypothesis that objects belonging to P{Cl^) or 

to P(cr^) are positive and all the others negative. These rules suggest that an object 

could belong to "at least class C7/' or "at most class C4", respectively. 

Assuming that for each criterion q^C, (i.e. Vq is quantitative) and that for 

each x,y^ U, f{x,q)^f{y,q) implies xSqy (i.e. Vq is preference-ordered), the following 
five types of decision rules can be considered: 

1) certain D^-decision rules with the following syntax: 

if Ax,qiy^rq\ and j{x,q 2 )>r^ and ... f{x,qp)>rqp , then xe Cl , 

2) possible D^-decision rules with the following syntax: 

if fipc,q\)>rqi and J{x,q 2 )>r^ and ,,.J{x,qp)>rqp , then x could belong to Cl , 

3) certain D^-decision rules with the following syntax: 

if andAx,qf)^rq 2 and ...Ax,qp)^qp , then x& Cl~ , 

4) possible D^-decision rules with the following syntax: 
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if anclAx,qA^r^ and ...Ax,qp)^qp , then x could belong to C/“ , 

where P=[q^^,...,qp}<^C, VqixVq 2 ><:.><VqpWA teT; 

5) approximate Y>-^^-decision rules with the following syntax: 

// Xx,( 7 i)>r^i andf{x,q 2 )>r ^2 and .,. J{x,qk)>r^kandf(x,qk^^^ and ,., f{x,qp)<r^p , 
then X G C4uC7^+iU . . . uC7^, 

where 0"={qx,..,,qk}<^C, 0'"={qk^i,...,qp}<^C, P=0’uO”, O’ and O” not necessarily 
disjoint, sd^T such that s<t. As it is possible that 

in the condition part of a D>^-decision rule we can have 
'%x,q)>r^" and "%x,q)<r'q \ where for some q^C. Moreover, if the two 

conditions boil down to '%x,q)=rq\ 

The rules of type 1) and 3) represent certain knowledge extracted from the data 
table, while the rules of type 2), 4) represent possible knowledge, and rules of type 5) 
represent ambiguous knowledge. 

Moreover, each decision rule should be minimal Since a decision rule is an 
implication, by a minimal decision rule we understand such an implication that there is 
no other implication with an antecedent of at least the same weakness (in other words, 
rule using a subset of elementary conditions or/and weaker elementary conditions) and 
a consequent of at least the same strength (in other words, rule assigning objects to the 
same union or sub-union of classes). 

Consider a D>-decision rule '7/ /(x,^i)>r^i and f{x,q 2 )>rq 2 ci^d .,,flx,q^>rqp, then 
xe Cl ”■ If there exists an object je ^Cl ) such that cmd 

... J{y,qp)=rqp, then 7 is called basis of the rule. Each D>-decision rule having a basis is 
called robust because it is "founded" on an object existing in the data table. 
Analogous definition of robust decision rules holds for the other types of rules. 

We say that an object supports a decision rule if it matches both condition and 
decision parts of the rule. On the other hand, an object is covered by a decision rule if 
it matches the condition part of the rule. 

A set of certain and approximate decision rules is complete if three following 
conditions are fulfilled: eachj^G C_{Clf) supports at least one certain D>-decision rule 
whose consequent is xg Clr with r,t^ {2,...,^} and r>t ; each y g C_{Clf) supports at 
least one certain D^-decision rule whose consequent is xg cit with w,ifG {l,...,w-l } 
and u<t ; and each C{Cl~^ <^C{ci ) supports at least one approximate D><- 

decision rule whose consequent is XGC/vUC/vfiU...uC4", with s.tyd^T s<v<z<t. 

In simple words, complete means that the set of rules is able to cover all objects 
from the data table in such a way that consistent objects are re-assigned to their 
original classes and inconsistent objects are assigned to clusters of classes referring to 
this inconsistency. An analogous definition of completeness can be formulated for a 
set of possible decision rules. 

We call minimal each set of minimal decision rules that is complete and non- 
redundant, i.e. exclusion of any rule from this set makes it non-complete. 
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3. DOMLEM algorithm 

Various algorithms have been proposed for induction of decision rules within 
CRSA (see e.g. [4,7,3] for review). Many of these algorithms tend to generate a 
minimal set of rules with the smallest number of rules. It is an NP-hard problem, so it 
is natural to use heuristic algorithms for rule induction, like LEM2 algorithm proposed 
by Grzymala [3]. In this paper, we approach the same problem with respect to DRSA. 

The proposed rule induction algorithm, called DOMLEM, is built on the idea of 
MODLEM algorithm [8]. The latter, inspired by LEM2 [3], was designed to handle 
directly numerical attributes during rule induction. 

The main procedure of DOMLEM is iteratively repeated for all lower or upper 
approximations of the upward (downward) unions of decision classes. Depending on 
the type of the approximation we are getting the corresponding type of decision rules,: 
e.g. of typel) from lower approximation of upward unions of classes, and of type 2) 
from upper approximation of upward unions of classes. 

Moreover, taking into account the preference-order of decision classes and the 
requirement of minimality of decision rules, the procedure is repeated starting from 
the strongest union of classes, e.g. for type 1) decision rules the lower approximations 
of upward unions of classes should be considered in the decreasing order of the 
classes. 

In the algorithm, PcC and E denotes a complex (conjunction of elementary 
conditions e) being a candidate for a condition part of the rule. Moreover, [E] denotes 
a set of objects matching the complex E. Complex E is accepted as a condition part of 
the rule iff 0^[E] = [e\<^B , where B is the considered approximation. For the 

sake of simplicity, in the following we present the general scheme of the DOMLEM 
algorithm only for a case of type 1) decision rules. 

Procedure DOMLEM 

(input: Lwpp- a family of lower approximations of upward unions of decision classes: 
{ EiCl ) , EiClf.\ ) V • • :^(C/|) };output: R> set of D>-decision rules); 

begin 

R>= 0; 

for each Lupp do 
begin 

E:=find_rules(5); 

for each rule £ e E do 

if E is a minimal rule then /?>:=/?>uE; 

end 

end. 

Function find rules 

(input; a set B\ output; a set of rules E covering set B)\ 

begin 

G := B\ {a set of objects from the given approximation} 

E := 0; 
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while G ^ 0 do 
begin 

£ := 0; {starting complex} 

S := G; {set of objects currently covered by E} 

while (E = 0) or not ([E] c B) do 

begin 

best := 0; {best candidate for elementary condition} 
for each criterion e P do begin 
Cond:-{(J{x,qi)>rqi) :3x^S 

{for each positive object from S create an elementary condition} 
for each elem e Cond do 

if evaluated elem} uE) is_better_than evaluatedbestjuE) then best:=elem; 
end; {for} 

E := Ekj {best}; {add the best condition to the complex} 

S := S n [best]; 
end; {while not ([£] c B)} 
for each elementary condition e e Edo 
if [E - {e}] Q B then E := E - {e}; 
create a rule on the basis of E; 

E:=Eu{£}; {add the induced rule} 

G := B- Ue6e[^]; {remove examples covered by the rule} 
end; {while G ^ 0} 
end {function} 

Let us comment the choice of a best condition using function evaluate(E). A candidate 
E for a condition part of a rule could be evaluated by various measures. In the current 

version of DOMLEM the complex E with the highest ratio |[£^] n o|/|[£]| is chosen. 

In case of a tie, the complex E with the highest value of | [E] n g| is chosen. 

In the case of other types of decision rules, the above scheme works with 
corresponding approximations and elementary conditions. For example, in the case of 
type 3) rules, the corresponding approximations are the lower approximations of the 
downward unions of classes, considered in the increasing order of preference, and the 
elementary conditions are of the formfix,qi)<r^. In the case of type 5) rules, there are 
considered intersections of upper approximations of upward and downward unions of 

classes P(Clf)r\P(Clf ) , s<t, and the elementary conditions have the form 

and f{x,q')<r'q’ for q,q'^C; if q=q\ then Furthermore, because of testing 

minimality of rules, in the case of type 5) rules, it is useful to discover in a given 

intersection K= P(Clf)nP(Clf ) , s<t, two subsets of objects, called "lower edge" 
and "upper edge" defined respectively as: the set of objects from K that do not 
dominate any other object from K having different evaluation on considered criteria, 
and the set of objects from K that are not dominated by any other object from K 
having different evaluation on considered criteria. Then combinations of conditions 
based on object from the “lower edge” with conditions based on objects from the 
“upper edge” are the only candidates for entering a complex. 
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Notice that requirement for inducing robust decision rules restricts the search space 
as only conjunctions of elementary conditions with thresholds referring to the same 
basis objects are allowed. Let us shortly discuss the computation complexity of the 
DOMLEM algorithm. We assume that the basic operation is checking which examples 
are covered by a complex (condition). Let m denotes a number of attributes, « is a 
number of objects. In the worst case each rule covers a single object using all criteria. 
In this case inducing robust rules requires at most n{nm+l>m-2)l2 operations. On the 
other hand, while looking for non-robust rules one cannot restrict the search to 
conditions based on basic object only. Thus (assuming that each criterion is on 
average chosen once) we need at most nm{n+\){m+\)IA operations. So, the 
complexity of the algorithm is polynomial. 

Illustrative example: 

Consider the following example (see Table L). A set of 17 objects is described by the 
set of 3 criteria C={q\, q 2 , qs} ~ all are to be maximized according to preference. The 
decision attribute d classifies objects into three decision classes C/i, C/2, C/3 which are 
preference-ordered according to increasing class number. 



Table 1. Illustrative data table 



Object 




72 


73 


d 


1 


1.5 


3 


12 


Ch 


2 


1.7 


5 


9.5 


Ch 


3 


0.5 


2 


2.5 


Cli 


4 


0.7 


0.5 


1.5 


Cl, 


5 


3 


4.3 


9 


C13 


6 


1 


2 


4.5 


CI2 


7 


1 


1.2 


8 


Cl, 


8 


2.3 


3.3 


9 


C13 


9 


1 


3 


5 


Cl, 


10 


1.7 


2.8 


3.5 


C12 


11 


2.5 


4 


11 


C12 


12 


0.5 


3 


6 


CI2 


13 


1.2 


1 


7 


C12 


14 


2 


2.4 


6 


Cl, 


15 


1.9 


4.3 


14 


CI2 


16 


2.3 


4 


13 


C13 


17 


2.7 


5.5 


15 


C13 



The downward and upward unions of classes are the following C/f ={3,4,7,9,14}, 
CV|={1,2, 3,4, 6,7, 9,10, 11,12,13,14,15}, C7| = {1,2, 5,6, 8,10,1 1,12,13,15,16,17}, 

C/| ={5,8,16,17}. There are 5 inconsistent objects violating the dominance principle, 
i.e. 6,8,9,11,14. For instance, object # 9 dominates object # 6, because it is better on 
all criteria q], ^2, ^73, however, it is assigned to the decision class C7] worse than C/2 to 
which belongs object #6. So, the C approximations of upward and downward unions 

of decision classes are: C(C/r) ={3,4,7}, C(af) ={3,4,6,7,9,14}, Bnc(Clf)= {6, 
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9,14}, C(C/|) ={1,2,3,4,6,7,9,10,12,13,14,15}, C(C/2-)={l,2,3,4,6,7, 8,9,10,11,12, 



13,14,15}, C(C/|)={1,2,5,8,10,1 1,12,13,15,16,17}, C(a|) = 



{1,2,5,6,8,9,10,11,12,13,14,15,16,17}, Bnc{Cl^)={6,9,\A}, C{Clf)={5,\6M}, 

C(C/3^)={5,8,11, 16,17}, 5«c(C/3-)={8,11}. 

Let us illustrate in detail the induction of certain D>-decision rules for the upward 
union C/f . The lower approximation C(C/f ) is an input set B to the DOMLEM 

function find_rules. The elementary conditions for objects {5,16,17} are as follows 
(reported elements mean: the condition ^i, the set of objects satisfying the condition ^i, 
the first evaluation measure |[^i]^G|/|[ei]l, the second evaluation measure |[^i]nG|): 
e,={f(x,qd>23\ {5,8,11,16,17}, 0.6, 3; e,={J(x,q 2 ) >5 {17}, 1.0, 1; 

e 2 =(Ax,qi)> 2 . 1 \ {5,17}, 1.0, 2; e,={J{x,q,)>9\ {1,2,5,8,11,15,16,17}, 0.38, 3; 

>4), {2,5,11,15,16,17}, 0.5, 3', e,-{fix.q^)>\3\ {15,16,17}, 0.67, 2; 
e4=(/(x,<?2) >4.3), {2,5,15,17}, 0.5, 2; ^g=(/(x,^3)>15), {17}, 1.0, 1; 

The condition 62 is found the best because its first measure is the highest and it 
covers more positive examples than and e^. Moreover, as 62 satisfies the inclusion 
[e 2 \<^B, it can be used to create a rule covering two objects # 5 and 17. They are 
removed from G and the last remaining positive example to be covered is 16. Now, 
there are available three elementary conditions: e^=(J{x,q})>2.3), {8,11,16}, 0.33, 1; 
eio=(f(x,q2) >4) {2,11,15,16}, 0.25, 1; en=(f(x,q3) >13), {15,16}, 0.5, 1. 

The condition ^n=(f(x,q3) >13) is chosen due the highest first evaluation measure. 
On the other hand, it is not sufficient to create a rule using only this condition because 
it covers object # 15 which is a negative example. So, in the next iteration one has to 
consider complexes E=e 9 ^eu and E= eioAeu- As the complex E=e 9 Aeu has a higher 
first evaluation measure 69 is chosen. Notice that (f(x,q3) >13) and (f{x,qj)>2.3) can be 
now accepted for the condition part of a rule as it covers objects # 1 6 and 1 7. 
Proceeding in this way one obtains finally the minimal set of decision rules: 

//(/(x,^3) <2.5), thenxGCl^ {3,4} 

//'(/(x,^2) ^1 .2), and (/(x,^i) <1 .0) then xg C7f {4, 7} 

// (/(x,gi) ^2.0), thenx^Cli {1, 2, 3, 4, 6, 7, 9, 10, 12, 13, 14, 15} 



i/X/(x,^i)^ 2.7), then xg C/3- {5, 17} 

if{f{x,q^y> \3.{}) and {j{x,q\)> 2.3), thenx^Cd\ {16, 17} 

//(/(x,^i)^ 1.2) and{f{x,q^)> 7.0), then xg C/| {1, 2, 5, 8, 11, 13, 15, 16, 17} 



if (/(x,^2)> 2.8) and(f{x,q^)> 6.0), then xg C/| {1, 2, 5, 8, 11, 12, 15, 16, 17} 

if{Ax,q 2 )> 2 .^)and{f{x,qf> 1.7), thenx^Cl^ {2,5,8, 10, 1 1 , 1 5, 16, 17} 
if {i{x,q\) >2.3) and ifix,q 2 ) <3.3), thenxeCf^Cl^ {8} 

//(f(x,qi) >2.5) and ifx^q^) <4.0), then XGC/2UC/3 {11} 
ififx^q'i) <6.0) and (f(x,qi) >2.0), then xGC/iuC/2 {14} 

if(J{x,q^) >4.5) and (/(x,^3) <5.0), then xGC/iuC/2 {6, 9} 
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4. Decision rules in Variable Consistency model of Dominance- 
based Rough Set Approach 



In [2] we proposed a generalization of DRSA to variable consistency model (VC- 
DRSA). It allows to define lower approximations of the unions of decision classes 
accepting limited number of negative examples controlled by pre-defined level of 
consistency /g( 0, 1]. Within VC-DRSA, given PeC and consistency level /, the P- 
lower and P-upper approximations of the upward unions of classes are the following: 



P\cif)={x^ClT-. 



card{Dp{x) r\Clf) 
card{Dp{x)) 



= af^{xscif-i 



cardiDp{x)r^Clf_i) , 



card(Dp{x)) 

The definitions of approximations for downward unions are analogic - see [2]. 
These approximations are used for induction of decision rules having the same syntax 
as in DRSA. In the VC-DRSA context each decision rule is characterized by an 
additional parameter a called confidence of the rule. It is the ratio of the number of 
objects supporting the rule and the number of objects covered by the rules. 

The induction of such rules can be done after simple modifications of the 
DOMLEM algorithm. First, the inputs B of the algorithm are the new P^- 
approximations of upward or downward unions of decision classes. Notice that in 
DRSA the complex E was accepted as a condition part of a rule iff [P]^P. This 
corresponds to the requirement that |[P]nP|/|[P]|, should be equal to 1. The keypoint 
of VC-DRSA is a relaxation of this requirement permitting to build a rule based on a 
complex E having a confidence a not worse than the consistency level /. The rest of 
the algorithm remains unchanged. 



Continuation of the example. Let us assume that the user considers only criteria 
P={Qu< 12 } and is interested in analysing upward union C/| . The DRSA leads to 



P(C/|)={1, 2, 5, 8, 10, 11,15,16,17} and boundary 5«;>(C/|)={6,9,12,13,14}. The 



two following decision rules are induced to describe objects from P(C/|) : 

//■ (/(x,^7l)> 1.7)a«u'(/(x,^72)>2.8), r/je«xeCV|, {2,5,8,10,11,15,16,17} 

if{f{x,q\)> 1.5) and (f(x,q 2 )^ 3), then xe CI 2 , {1,2,5,8,11,15,16,17}. 

Let us assume now that the user works with VC-DRSA accepting consistency level 
/ equal to 0.75. As P-dominating sets of objects # 6, 12 and 13 are contained in C/| 
with a degree greater than the consistency level (0.83, 0,9 and 0.91, respectively) they 
can be added to the lower approximation P^'^^(C7f ) - The boundary region is now 
composed of only two objects # 9 and 14. Further on, the following rules are induced 
from the lower approximation P^'^^(c/|) (within parentheses there are objectss 

supporting the corresponding rules and objects only covered by the corresponding 
rules but not satisfying their decision parts - the latter are marked by 
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if (/(x,<gri)>1.2), thenx^ C/| with confidence 0.91, {1,2,5,8,10,1 1,13, 14*, 15, 16, 17} 

if then xe C/| with confid. 0.79, (1,2,3*,5,6,8,9*,10,1 1,1 2, 14*, 15, 16, 17} 

One can also notice that these rules are supported by more examples (i.e. 10 and 
11, respectively) than the previous ones (8 in both). 



6. Conclusions 

The paper addressed the important issue of inducing decision rules for multicriteria 
sorting problems. As none of already known rule induction algorithms can be directly 
applied to multicriteria sorting problems, we introduced a specific algorithm called 
DOMLEM. It produces a complete and non-redundant, i.e. minimal, set of decision 
rules. It heuristically tends to minimize the number of generated rules. It was also 
extended to produce decision rules accepting a limited number of negative examples 
within the variable consistency model of the dominance rough sets approach. 
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Abstract. Monitoring the mechanical impact of a loose (detached or 
drifting) part in the reactor coolant system of a nuclear power plant 
is one of the essential functions for operation and maintenance of the 
plant. Large data tables are generated during this monitoring process. 
This data can be “mined ” to reveal latent patterns of interest to oper- 
ation and maintenance. Rough set theory has been applied successfully 
to data mining. It can be used in the nuclear power industry and else- 
where to identify classes in datasets, finding dependencies in relations 
and discovering rules which are hidden in databases. This paper can be 
considered as one of a series, the earlier ones being summarized in Guan 
& Bell (2000a). These methods can be used to understand and control 
aspects of the causes and effects of loose parts in nuclear power plants. 
So in this paper we illustrate the use of our data mining methods by 
means of a running example using Envelope Rising Time data ERT on 
monitoring loose parts in nuclear power plants. 



Introduction 

A significant percentage of the world’s electricity is now produced by nuclear 
power plants. Monitoring the mechanical impact of a loose (detached or drifting) 
part in the reactor coolant system of a nuclear power plant is one of the essential 
functions for operation and maintenance of the plant. One way of contributing 
to the solutions of problems in this area is to gain clear insights into causes and 
effects of loose parts using data mining. This is the computer-based technique of 
discovering interesting, useful, and previously unknown patterns from massive 
databases — such as those generated in nuclear energy operation. Our approach 
to data mining is to use rough set theory (Pawlak, 1991; Pawlak, Grzymala- 
Busse, Slowinski, & Ziarko 1995). 

In many applications rough set theory has been applied successfully to rough 
classification and knowledge discovery. We have previously presented results of 
such application to nuclear power generation operation and control (Guan & Bell 
1998, 2000b). Methods for using rough sets to identify classes in datasets, finding 
dependencies in relations and discovering rules which are hidden in databases 
have been developed. We use these methods again but here we apply them to the 

W. Ziarko and Y. Yao (Eds.): RSCTC 2000, LNAI 2005, pp. 314-321, 2001. 
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loose part monitoring problem. The data to be mined here is the Envelope Rising 
Time data ERT, expressing the features of loose parts. This data is analyzed as a 
running example throughout this paper, and is used to illustrate our algorithms 
as they are presented. 

The shape of the paper follows the usual pattern for this series of papers 
on the nuclear power generation, and the algorithms are those used for other 
safety applications in previous papers. In section 1, we discuss decision tables. 
A decision table consists of two finite sets: a universe U and an attribute set 
A= Cult, where CPiIt = 0, and attributes in C are called condition attributes^ 
and attributes in D are called decision attributes. We have previously proposed 
algorithms for discovery of knowledge based on rough analysis (Bell Guan 1998). 
Section 2 illustrates a simple classification for an attribute, and we introduce 
rough subsets and support subsets. In section 3, we discuss the support degree 
to a decision attribute from many condition attributes. Section 4 deals with the 
significance of a condition attribute a in C. We deal with attribute significancy 
on one subset of condition attributes. These methods are also applied to the 
ERT data. Einally, in section 5, an algorithm to discover knowledge is used for 
the ERT data. 

1 Information Systems and Decision Tables 

An information system 2 is a system < C, A >, where 

1. U = {ui,U 2 , ..., ..., U|[/|} is a finite non-empty set, called the universe 
or object space] elements of U are called objects] 

2. A = {ai,a 2 , is also a finite non-empty set; elements of A 
are called attributes] 

3. for every a G A there is a mapping a from U into some space a : C ^ a(C), 
and a{U) = {a(u) \ u E U} is called the domain of attribute a. 

We want to find dependencies in relations and to discover rules which are 
hidden in databases. We can consider some attributes are condition attributes 
and some others are decision ones. Then we can discover the relation between 
condition and decision, predict decision from condition. Thus, an information 
system < G, A > is called a decision table^ if we have A = C D and C C\D = 0, 
where attributes in C are called condition attributes and attributes in D are 
called decision attributes. 

Example 1.1. The Oh ERT data — Envelope Rising Time on mon- 
itoring loose parts 



rule 


TRl TR2 


TR3 


Confid. 


rule 


TRl 


TR2 


TR3 


Confid. 


Ui 


Short Short 


Middle 


High 


u^ 


Middle Short 


Middle 


Middle 


U2 


Short Middle Middle 


High 


Us 


Middle Middle Middle 


Middle 


Us 


Short Middle Long 


High 


Ug 


Middle Middle Short 


Low 


U4 


Short Short 


Short 


Middle 


Ulo 


Long 


Any 


Any 


Low 


Us 


Short Short 


Long 


Middle 


till 


Any 


Long 


Any 


Low 


Uq 


Short Middle Short 


Middle 
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In the table above TRl/rR2/rR3 are the Envelope Rising Time of the first, 
second, third arrived signal, respectively. 

Oh (Oh 1996) made the fuzzy rule base for the evaluation of confidence level 
of input signals based on the relationship between the wave arrival sequences 
and envelope pattern changes. 

Note that this table is an information system, where 

A = {TR1/1R2/1R3^ Confidence}. In fact, it is also a database relation. 

To consider it as a decision table, let C = {T R 1 /R R 2 /1R3} and D = {y|, 
y = Confidence. Then the information system becomes a decision table. 

2 Rough Subsets and Support Subsets 

Let < C, A > be an information system. For every attribute a E A we can 
introduce a classification U /a m universe U as follows: two objects u^v E U are 
in the same class if and only if a[u) = a[v). 

Let IT be a subset of U . For a classification C/a, the rough subset of subset 
IT from U/ a is Si pair of subsets (IT^^/®) , IT^^/®)^), where 

1. IT^^/®^^ = ^V(^U/a, is called the upper approximation to IT from 

U/a. It is not used in this paper; 

2. IT^^/®^ = ^y^u/a, ycw^ ^ also denoted by ^^^(IT), is called the lower- 

approximation to IT from U/a. Subset Sa{W) is also said to be the support 
subset to IT from attribute a, and spta{W) = |Na(IT)|/|t/| is said to be the 
support degree to IT from attribute a. 

Briefly, N^(IT) means that rule “x = x{Sx{W)) implies y = y(N^(IT))” has 
strength sptx{W ) = |N^(IT)|/|C| (Grzymala-Busse 1991, Guan Bell 1991). 

When spta{W) = 0 there is an inconsistence. 

Let y E D be Si decision attribute in a decision table < U^A >, where 
A = C U D^C D D = 0. We now consider an overall decision for a decision 
attribute y E D rather than a local decision for a “decision subset” IT E U /y. 

The support subset to decision attribute y E D from condition attribute a E C 
is subset Sa{y) = Uweu/yW^^/°-'> = ^weu/yyveu/a,vcw^), and spta{y) = 

\^weu/yW^^/'^'> \!\U\ is called the support degree to y from a. 

If U/y = U /^, where 6 is the “universal” partition U /6 = {C}, then we have 
Sa{y) = U, spta{y) = 1 for all ae C. 

Example 2.1. For the ERT data, we have classsiflcations as follows. 

U/y = {Wi, W 2 , 14 / 3 }, where Wi = {ui,U 2 ,U 3 j, 

W 2 = {u 4 ,Us,U 6 ,Ur,Us}, W 3 = {ug,Uio,Ull}] 

U/TR2 = {{ui,U4,U5,Ur},{u2,U3,U6,Us,Ug},{uw},yW}^ 

U/TR3> = {{«!, M 2 , M7, Us}, {u 3 , Ms}, {“4, U 6 ,Ug},{uw, Mil}}. 

Also, we have 

Nj’i^i(ITi) = Stki{W2) = {}, Stki{Ws) = {uio.uii}; 

StR2{^i) = StR2{^2) = {}, StR2{^3) = {^ 10 , 

StR3{Wi) = StR3{W2) = {}, StR3{Ws) = 
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For example, Sth2{^^) = {^10,^11} means the tuples u = uiq^uh in VF3 = 
{ug, uio} support a rule which states that condition TR 2 {u) = Any^ Long implies 
decision y{u) = Low with strength sptTR2{W3) = 2/lF 
Also, Stk2{^i) = {} means there is an inconsistence. 

For example, let us consider a class {ui, U4, U5, U7} in Lf /TR 2 . 

On one hand, a tuple u = ui m the class and in Wi = {ui, U2, u^} supports a 
rule which states that condition TR 2 [u) = Short implies decision y[u) = High. 

On the other hand, another tuple u = U4 in the class but not in VFi = 
{ui, U2, U3} supports a rule which states that condition TR 2 {u) = Short implies 
decision y{u) = Middle. 

These two rules are inconsistent. 

Finally, we have SrRiiy) = STR2{y) = SxRaiy) = {mio,mii}. 

3 The Support Degree from Many Conditions 

Let < t/, A > be an information system. For two attributes a, 6 G A we need to 
compute the following classification U/ah in universe U: two objects u^v E U 
are in the same class if and only if a[u) = a[v) and b[u) = b[v). 

For a set A C A of attributes we define classification U / X in universe U as 
follows: two objects u^v E U are in the same class if and only if a[u) = a[v) for 
every a E X. 

We also define t//0 = U/ 6 ^ where 6 is the “universal classification” U /6 = 
{U},\U/S\ = l. 

Let W C F be a subset of universe U . We now consider support to the subset 
W from a set of condition attributes. 

For a condition attribute set A C (7 in the decision table < U^CU D >, the 
support subset to W from condition attributes A is subset Sx{^) = = 

L^veu/x,vcwyy sptx{W) = \/W\ called the support degree to 

W from A . 

Briefly, A^(VF) means that rule “A = A(S'^(VF)) implies Y = F(A^(VF))” 
has strength sptx{^) = |*S'x(fF)|/|t/|. 

When spta{W) = 0 there is an inconsistence. 

Let Y C 17 be a subset of some decision attributes in < U^C U D >. We 
now consider support to the subset Y from a condition attribute set A C (7, the 
support subset to attributes Y from condition attributes A is subset 

*S"x(F) = = UvyGL/y(UveL/w,ycw F), 

and sptx{Y) = \ A^y^u/y \/W\ called the support degree to Y from 

A. 

Example 3.1. For the ERT data, we And that U/TRl ATR 2 is 
{{ui,U4,U5}, {u2,Us,Uq}, {^7}, {^8,^9}, {uig}, {till}}; U/TRl ATR 3 is 
{{ui, U2}, {U3, U5}, {1x4, uq}, {1x7, xxg}, {xxg}, {xxiq}, {till}}; U /TR 2 A TR 3 is 
{{xxi, XX7}, {XX2, tig}, {ti3}, {ti4}, {tis}, {tie, tig}, {tiig}, {till}}; 

U/TRl A TR 2 A TRZ is 




318 J. W. Guan and D.A. Bell 



{u 2 },{us}, {Ui}, {us}, {ue}, {ur}, {wg}, {wg}, {mio}, 

Also, we have 

StR1ATR2{Wi) = {}, StR1aTR 2{W2) = {ur}, StR1aTR2{Ws) = 

For example, Striatr 2 {^ 2 ) = {^r} means the tuple u = ur iti W 2 = 
{u 4 , U 5 , U 6 , U 7 , ug} supports a rule which states that condition TRl[u) = Middle 
and TR2[u) = Short implies decision y[u) = Middle, 

Also, Striatr2{^i) = {} means there is an inconsistence. 

For example, let us consider a class {ui,U 4 ,U 5 } in U/TRl ATR2. 

On one hand, a tuple u = ui m the class and in VFi = {ui, U 2 , ug} supports a 
rule which states that conditions TRl{u) = Short and TR2{u) = Short implies 
decision y[u) = High. 

On the other hand, another tuple u = U 4 in the class but not in VFi = 
{ui,U 2 ,U 3 } supports a rule which states that condition TRl{u) = Short and 
TR2[u) = Short implies decision y[u) = Middle. 

These two rules are inconsistent. 

Finally, we have 5Ti?iATi^3(lTi) = {^ 1 ,^ 2 }, Striatrs{^ 2 ) = {^^4, ^6, 
*S"Ti?lATi^3(^^3) = {^9, ^10, ^ 11 }; StR2ATR3{^i) = {^ 3 }, 

StR2ATR3{'^2) = {^ 4 ,^ 5 }, StR2ATR3{'^3) = {^ 10 , 

*S"Ti?lATi^2ATi^3(lTi) = VFi = {ui,U2,Us}, 

StR1ATR2ATR3{W2) =14^2 = {^ 4 , U 5 , Ue, U 7 , Ug}, 

*S"4’i^lA4’i^2A4’i^3(lT3) = VF3 = {uq,Uio,Uu}. 



4 Significance, Significant Subsets of Attributes 

Definition 4.1. Let A be a non-empty subset of (7: 0 C A C (7. Let F be a 
subset of D: Y CD such that Y ^ ^^U/Y ^ U/6 = {U}. Given an attribute 
X G A, we say that x is significant (for T) in A if *7x(T) 7 S'x-{®}(T); and 
that X is not significant or nonsignificant (for Y) in A if *7x(F) = Ax-{^}(F). 

Definition 4.2. Let A be a non-empty subset of (7: 0 C A C (7. Let Y 
be a subset of It: F CD such that F U/6 = {U}. Given an 

attribute x G A, we define the significance of x (for T) in A as sig^_^^y{x) = 
\Sx(y)\-\Sx-^^RY)\ 

Tl 

In the special case where F is a singleton, F = {y}, we also denote sig^__^^^y[x) 

by 

In the special case where A is a singleton, A = {x}, we also denote sig^ (x) 
by sig^{x): sig^ {x) = sig\ (x) = ■^4^)7 1 ■^4^) 1 = , 

So we always have si^^(x) > 0 unless Sx(Y) = 0. 

Also, in the special case where A contains two attributes, A = {xi,X 2 }, we 
denote sigg^y(xi) by sig^Jxi): sig^ (x) = sig^Jxi) = ^ 

Definition 4.3. Let It=<t/, (7Ult>bea decision table, where (7 is the 
condition attribute set and D is the decision attribute set. Let F be a subset of 
D:Y C D such that Y /Y U/6 = {U}. Let A be a non-empty subset of 
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C\ $ d X C (7. The non-empty subset X is said to be significant or independent 
(for Y) if each x G is significant (for Y) in X] otherwise X is nonsignificant 
(for y). An empty set 0 is said to be significant (for T). 

To check whether or not X is significant, we can computes |A| significances 
^ ••• 7 1^1 check whether or not they are greater than 

0 . 

Example 4.1. For the ERT data, we have the following. 

(1) From example 2.1 we find that 

sigy{TRl) = sptTRi{y) = 2/11; sigy{TR2) = sptTR 2 {y) = 2/11; 
sigy{TRi) = sptTRfiy) = 2/11. 

(2) From example 3.1, we find that 

sig^^fiTR2) = 1/11; sigy^^fiTRl) = 1/11; sigy^^fiTR2>) = 7/11; 
sig^^fiTRl) = 7/11; sig^^fiT R2) = 3/11; sig^j^2{'^^^) = 3/11; 

^^9tR1aTR3^-^ ^ 2 ) = 2 / 11 ; ^WtriaTR2 ^3) = 8 / 1 1 ; ^^gTR2ATRs('-^ ^ 6 / 11 . 
Also, we know the significancy (for y) of the following subsets of A = 

{rRl,TR2,rR3}. 

(1) 0 is significant. 

(2) Singletons {TRl}^ {TR2}^ {TR3} are significant since 
sigy{TRl),sigy{TR2),sigy{TR3) > 0. 

(3) {TR1^TR2} is significant since sig^j^ 2 {TR^) > 0 and so TRl is signific- 
ant in {TR1/RR2}, 

{TR1/RR3} is significant since sigfij^fiTR3) = 7/11 > 0 and sigfij^fiTRl) = 
7/11 > 0. 

{7’R3, TR2} is significant since sig^j^fiTR2) = 3/11 > 0 and sig^j^ 2 {'-^'-^^) = 
3/11 > 0. 

{T R1 fiR R2 fiR R^} is significant since sig^j^ 2 ATRs{'-^'-^^) ^ 6/11 > 0 and so 
RRl is significant in {TRl, TR2, TR3}, sig^j^-^^j^j^fiRR2) = 2/11 > 0 and so 
RR2 is significant, ^3) = 8/11 > 0 and so RR3 is significant. 

5 An Algorithm to Discover Knowledge 

Definition 5.1. In V =< U,CU D >, let T be a subset of T: T C T such that 
T 7 ^ 0, U/Y fi: lJ/6 = {U}, Let A be a non-empty or empty subset A C C of 
C. A subset Ao of A is said to be a key of A (for Y) if Aq satisfies 

(1) Axo(y) = ^x(l^); 

(2) if A' C Ao then Sx{Y) D SxfiY). 

The empty subset 0 has key 0 (for Y). 

Applying the significance measure, we can design an algorithm to discover 
knowledge as follows (Grzymala-Busse, Slowinski et al 1992). 

Let T> =< R, (7uT > be a decision table, where C is the condition attribute 
set and D is the decision attribute set. 

Let A be a non-empty subset of ( 7 : 0 C A C ( 7 . 

Let T be a subset of T: T CD such that Y ^ /Y /^ = {U}. 
Algorithm D. Rhis algorithm finds one key of X (for Y ) and discovers 
knowledge. 
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Step 1. Let X = {xi, ^ 2 , Xj, xj}. Compute U/xi^U /x 2 , j ^U/y 

for y E y and 

C/y = {tLl,tL2,•••,T^^,•••,T^/}- 

Compute U /X and Sx {^i): Sx{^i): Sx{^i): Sx{^i)^ 

Compute sig^ {xj) for j = 1, 2, 

Choose Xj^ such that \Sxj^ 0^i)\ some i is a maximum (when there are 
more than one possibilities it can be run in parallel mode concurrently for all 
possibilities). 

For every i, j such that Sx^{Wi) ^ 0 discover the following rules: 

(u G Sx-{Wi)) If [xj] Xj is Xj{u) then (y) y is y{u) with strength sptx-{Wi), 

If yVi = {Wi G U/Y\Sxj{Wi) C Sx{Wi) for all j} is empty or J = 1 then 
the algorithm is completed. Now j{Sx.{Wi) C Sx{Wi)) = \/i3j{Sx.{Wi) = 
Sx{Wi))) for each Wi G U /Y there is an Xj such that Sx.{Wi) = Sx{Wi). The 
collection of these x^’s may be a key of X (for Y). 

Otherwise, go to step 2. 

Step 2. Compute U/xj^Xj and (x^) for j ^ 

Choose an Xj^ such that sig"^. [xj^) is a maximum. 

For Sx.^ x^^(Wi) - Sx^^{Wi) - Sx^^{Wi) ^ 0, where IF^ G Wi, we can discover 
the following rules: {u G Sx^^xj^{Wi) - Sx.^ {Wi)~ Sx.^ {Wi)) 

If {xj^) Xj^ is Xj^(u), {xj^) Xj 2 is Xj^{u)^ then {y) y is y{u) with strength 

Sptx X- (^0* 

If W 2 = {Wi G Wi\Sx^^xj^{Wi) C Sx{Wi)} is empty or J = 2 then the 
algorithm is completed and {xj^^Xj^} may be a key of X (for Y). 

Otherwise, go to step 3. 

Step |n:|. Compute U j Xj^Xj^...Xj^^^_^Xj and 

Let Xj|^l = Xj. 

For S'x{Wi) - Sx-{xx^^pWi) - Sx^^^^ {Wi) 0 where Wi e W\x\-i, we can 
discover the following rules: {u G Sx{Wi) — Sx-{xj^^^}{^i) ~ (^0) 

If(Xji) Xj, \sXj,{u), (Xj,) Xj, isXj,{u),..., 

^j\x\ i® ^j\x\{'^) fhoii {y) y i® y(“) strength sptx,,,xi,...xi^,^^ {Wi). 

Then the algorithm is completed and {xj^ , Xj^, Xj|^i_^ , Xj|^i } may be a 
key of X (for Y). 

Example 5.1. By using this algorithm, the decision table of example 1.1 
can be reduced to the following: 



RULES 


1 TRl TR2 


rm 


1 Confidence 


1 Strength 


Ul,U2 


1 Short 


Middle 


High 


12/11 


Us 


1 Short Middle Long 


High 


|1/11 


U4, Uq 


1 Short 


Short 


1 Middle 


12/11 


Us 


1 Short Short 


Long 


1 Middle 


11/11 


Ur, Us 


1 Middle 


Middle 


1 Middle 


|2/11 


Uq 


1 Middle 


Short 


1 Low 


|1/11 


UlO,Uii 


1 


Any 


1 Low 


12/11 
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Now this is an illustrative example. However, if it is scaled up, it can be 
appreciated that condensed rules of thumb can be distilled from large tables of 
monitoring data. The final rule on the table is interesting. If we take ‘any’ to be 
the same as blank (“doesn’t matter”) it says “confidence is low”. 

6 Summary and Future Work 

In this paper we apply some algorithms with relatively low computation times 
which are helpful in the distillation of rules from data on loose parts based on 
rough sets. The results show how condensed rules which may be useful for safety 
and general control can be derived from large collections of data. 
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Abstract. Classification rules induction is a central problem addressed 
by machine learning and data mining. Rough sets theory is an important 
tool for data classification. Traditional rough sets approach, however, 
pursuits the fully correct or certain classification rules without consider- 
ing other factors such as uncertain class labeling, importance of exam- 
ples, as well as the uncertainty of the final rules. A generalized rough sets 
model, GRS, is proposed and a classification rules induction approach 
based on GRS is suggested. Our approach extends the variable preci- 
sion rough sets model and attempts to reduce the influence of noise by 
considering the importance of each training example and handling the 
uncertain class labels. The final classification rules are also measured 
with the uncertainty factor. 

Keywords: Rough set theory, supervised learning, classification, rule 
induction. 



1 Introduction 

Supervised learning or classification is an important research topic in machine 
learning and data mining [1]. Rough sets theory can be used to induce rules 
from large data sets [5]. It is complementary to statistical methods and provides 
the necessary framework to conduct data analysis and knowledge discovery from 
imprecise and ambiguous data. A number of algorithms and systems for learning 
classifiers have been developed based on this theory [2, 4, 8]. 

The original rough sets approach pursuits the fully correct and certain classi- 
fications within the available information. Unfortunately, the available informa- 
tion usually allows only for partial classification. As a result, classification with 
a controlled degree of uncertainty, or a classification error rate, is outside the 
realm of rough set theory. The variable precision rough set model (VP-model) 
[7] presents the concept of the majority inclusion relation. Rules which are al- 
most always correct, called strong rules, can be extracted with the VP-model. 
Such strong rules are useful for decision support in a rule-based expert system. 
However, these approaches have limitations. The following are some of them. 

(1) All tuples are treated with equal importance [2, 3]. Usually, the original 
data are generalized into concise form by finding attribute reduct. The tuples 
with the same values of attributes in the reduct are combined together and 

W. Ziarko and Y. Yao (Eds.): RSCTC 2000, LNAl 2005, pp. 322-329, 2001. 
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a “vote” field is attached for the count of the combined tuples. In real-world 
applications, different tuples may have different degrees of importance to the 
decision attributes, thus should have different contributions to the “vote”. 

( 2 ) All training examples must be crisply labeled, otherwise the final result 
will be incorrect and lead to big classification error. In some actual applications, 
however, it is very expensive and/or risky to make a yes-no decision. In such 
cases, it could be helpful to associate some uncertainty factor with the class 
label. 

( 3 ) The lower and upper approximations of a concept are defined based on 
the strict set inclusion operation, which has no tolerance to the noise data in 
the classification [6]. For example, assume X = {xi, X2, . . . , xgg, . . . , X500}, and 
T'l = {xi^ X2, . . . , X99, ^501}, and E2 = {^rsoo, ^501, • • • , ^599} are two equivalent 
classes. All examples in E± are in X except X501, and all examples in E2 are not 
in X except a?5oo- In the traditional model, both Ei and E2 are treated equally 
and thus put in the boundary region. However, in practice, X501 may be noise in 
El and X500 may be noise in E2. It seems reasonable to put Ei in the positive 
region and E2 in the negative region. 

We propose an approach for learning classification rules based on a new 
generalized rough set model, GRS, which extends the concept of the variable 
precision rough set model. Our new approach will deal with the situations where 
uncertain objects may exist, different objects may have different degrees of im- 
portance attached, and different classes may have different noise ratios. The 
original rough sets model and the VP-model of rough sets [ 7 ] become a special 
case of this model. The primary advantage of the GRS model is that it extends 
the traditional rough sets model to work well in noisy environments. 

This paper is organized as follows. A generalized rough set model, GRS, is 
suggested in Section 2 to overcome the above limitations. A supervised learning 
approach based on GRS is developed in Section 3 . In Section 4 , an illustration is 
investigated using the generalized model to learn classification rules from noise 
data. Finally, Section 5 is concluding remarks. 

2 Generalized Rough Sets Model 

In order to overcome the limitations discussed in the previous section, we propose 
a generalized rough set model which is developed from the traditional model and 
the VP-model to deal with the importance of tuples and uncertainty class labels, 
respectively. For simplicity, we only consider binary classifications, but it can be 
easily extended to multiple classifications. 

For our purpose, the information system IS is extended to the uncertain 
information system UIS as follows: 

UIS =< U, C, D, {VajaecJ, g, d >, 

where U = {ui, 1/2, . . . , u^} is a non-empty set of tuples, C is a non-empty set 
of condition attributes, D is a binary decision attribute with possible values 1 
and 0 , where 1 represents the positive class, while 0 the negative class. Va is the 
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domain of attribute “a” with at least two elements. / is a function: U x C ^ 
Ha, which maps each pair of tuple and attribute to an attribute value, 
is a function: U [0, 1], which maps each tuple to a value between 0 and 1, 
indicating the certainty of being positive example, d is a function: U [0, 1], 
which assigns each tuple an importance factor to represents how important the 
tuple is for the classification task. The importance factor corresponds to the 
condition attribute set while the class label certainty corresponds to the 
decision attribute D. d x u contributes to the positive class, while d x (1 — u) 
contributes to the negative class. 

We adapt the concept of relative classification error introduced in [8] to 
deal with the noise data. The main idea is to put some boundary examples 
into the positive region or negative region. Therefore, the strong rules which 
are almost always correct are obtained. In actual applications, positive class 
and negative class may contain different kinds of noise and have different noise 
tolerance degrees. Two classification factors and (0.0 < P^ 3 , < 1.0) are 

introduced to solve this problem, which can be determined by estimating noise 
degree in the positive region and the negative region, respectively. 

Let P be a non-empty equivalent class in the approximation space. The 
classification ratios of E with respect to the positive class Pdass and negative 
class Ndass are defined as 

Cp{E) = X 9{x))/J2^eEd{x), 

Cn{E) = J2^eEid{0 X (1 - g{x)))/ J2^eE d{x) 

= 1 - Cp{E), 

respectively. Cp{E) is the certainty to classify E to the positive region, while 
Cn{E) is the certainty to classify E to the negative region. If tuples in E are 
classified to positive class, the classification error rate is 1 — Cp{E). If tuples in 
E are classified to negative class, the classification error rate is 1 — Cn{E). 

Eor the pre-specified precision threshold Pj^ and 7V^, E is classified to the 
positive class if Cp{E) > P/ 3 , or to the negative class if Cn{E) > Nj^. Otherwise, 
E is put to the boundary region. 

The concepts of set approximation m IS can be extended for UIS according 
to the classification factors Pj^ and Nj^. Let Rp^N be the indiscernibility relation 
based on a set of condition attributes B and = {Pi, P 2 , • • • , En] be the 

collection of equivalent classes of Rp^n- Assume X C P, then the positive lower 
approximation and upper approximation of X with respect to precision Pj^ and 
TV/ 3 , denoted POSp^{X\B) and N EGn^{X\B) respectively, are defined as 

POSp^{X\B) = |J{i? e : Cp{E) > 

NEGn,{X\B) = |J{P e R*p^j^ : Cn{E) > Np]- 

Similarly, the boundary region of X, PXPp^A^(X), is composed of those 
elementary sets which are neither in the positive region nor in the negative 
region of X, 

BNDp^^n,{X\B) = |J{P e R*p^j, : Cp{E) < Pp,Cn{E) < Np]. 
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Clearly, the difference of the extended concepts from the original ones is that 
the elementary sets are divided into positive or negative region in terms of their 
classification ratio instead of their inclusion in the target concept. and Np 
can be adjusted for different data sets with different noise levels. The positive 
region and negative region shrink while the boundary region expends as and 
increase. On the other hand, as Pp and Np decrease, the boundary area 
shrinks and the positive region and negative region will expand. 

Assume P is a subset of condition attribute set, P C C, and D is the decision 
attribute. Let P* be a family of equivalent class of the indiscernibility relation 
IN Dp^m{D) — {Pdass^ Nciass}- the given classification factors Pp and 
the dependency degree of the decision attribute D on the condition attribute set 
P is defined as 

Tux(,POSp^{P,,asAB) '^^xaNEGNf,(PclasAB)^^{^) 

'^x^U 

Simply, the dependency degree y(P, P, P^, 7V^) of the decision attribute D 
on the condition attribute set P at precision level Pj 3 and Njs is the proportion 
of the tuples in U that can be classified into positive or negative class with an 
error rate less than the pre-specified threshold (1 — P^) and (1 — Nj^). 

Finally, the concept of attribute reduct in UIS is proposed. By substituting 
the functional dependency degree in the traditional reduct definition with the 
dependency degree y(P, P, P^, 7V^) defined as above, the attribute reduct can 
be generalized to allow for a further reduction of attributes. 

For the given classification factors P^, Np: 

1. an attribute a G P is redundant in P if 

- {a}, D, Pp, Np) = 7(B, D, Pp, Np)-, 
otherwise a is indispensable; 

2. If all attributes in P are indispensable, then P is called orthogonal; 

3. P is called a redact of the condition attribute set C mUIS if and only if P 

is orthogonal and y(P, P, P/ 3 , = j{C^ P, Pjs^ ^/?)* 

Thus, an attribute reduct is such a subset of condition attributes that the 
decision attribute has the same dependency degree on it as that on the entire 
set of condition attributes, and no attribute can be eliminated from it without 
affecting the dependency degree. 

The concept of reduct is very useful in those applications where it is necessary 
to find the most important collection of condition attributes responsible for a 
cause-and-effect relationship and also useful for eliminating noise attributes from 
the information system. Given an information system, there may exist more than 
one reduct. Each reduct can be used as an alternative group of attributes which 
could represent the original information system with the classification factors 
P /3 and Np, An open question is how to select an optimal reduct. It certainly 
depends on the optimality criteria. The computational procedure for verifying a 
single reduct is very straightforward, but finding all reducts is hard. 



dB,D,Pp,Np) 
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3 Learning Classification Rules 

Based on the generalized rongh set model, we can design a procednre for learning 
classification rnles as follows. 

Assnme an nncertain information system UIS =< {Va}aec^ > 

and classification precision factors Pj^ and 7V^. The rnle indnction procednre 
consists of following three steps. 

Step 1 : Compnte the dependency degree of the decision attribnte D on the 
condition attribnte set C: 

• Calcnlate the discernibility eqnivalent classes of the training set U with 
respect to the condition attribnte set C. Generally, each distinct tnple 
forms an eqnivalent class. Let Xc be the set of all snch eqnivalent classes; 

• For each eqnivalent class X G Xc^ calcnlate its classification ratios 
Cp{X) and Cn{X); 

• Calcnlate the positive and negative regions of U with respect to 67, 

POSp^{U\C) and (t/|67), in terms of the classification ratios 

above; 

• Calcnlate the dependency degree of on 67 according to the positive 
and negative regions, that is, 7(67, i7, 

Step 2 : Find the generalized attribnte redncts of the condition attribnte set 
67 according to the attribnte dependency: 

For each non-empty snbset P of 67, P C 67, 

• Calcnlate the discernibility eqnivalent classes of U with respect to P. 
Let be the set of all snch eqnivalent classes; 

• For each eqnivalent class X G X^, calcnlate its classification ratios 
67p(X) and 67 at(X); 

• Calcnlate the positive and negative regions of U with respect to P, 
POSp^{U\B) and NEGn^{U\B); 

• Calcnlate the dependency degree of D on P, that is, 7 (P, P, Pjs^ ^/?)* 

• Compare the dependency degree of P on 67 with that of P on P. If 
7(67, P, P^, Njs) — 7 (P, P, Pf 3 ^ then P is an attribnte rednct of 67; 
otherwise it is not. 

Let PPP(67, P, P^, X^) be the set of all generalized attribnte redncts of 67. 

Step 3 : Constrnct classification rnles with certainty factors. For each at- 
tribnte rednct P G PPP(67, P, P^, X^) of 67, a set of rnles can be achieved as 
follows with each rnle corresponding to an eqnivalent class with respect to P. 

Assnme P consists of m attribntes, P = {Pi, P 2 , . . . , Pm}? and X is an 
eqnivalent class with respect to P, X G X^. According to the definition of 
discernibility relation, all tnples in X have the same attribnte valnes for all 
attribntes in P, assnming Pi = 61 , P 2 = 62 , ... , Pm = A classification rnle 
for the positive class can be constrncted according to X. 

Rulep(X): If Bx = h, B 2 = h, ■ ■ ■, 

then D=positive class [CF = CFp[X)), 
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where the certainty factor CFp[X) is defined as CFp[X) — min^^x 

Similarly, a classification rnle for the negative class can be obtained as 

RuleN{X): If Bi - hi, ^2 = ^2, • • • , 

then D— negative class {CF = CFpf{X)), 

where the certainty factor CFpf{X) is defined as CFpf{X) = min^^x(l ~ 9{^))- 
For the dnal rnles Rulep{X) and Rulej^{X), they have the same condition 
part but opposite decision labels with different certainty factors, and the sum 
of their certainty factors is not 1. This shows that believing that an example 
belongs to the positive class with certainty of a does not mean believing it 
belonging to the negative class with certainty of 1 — a. It can be proved that 
CF{Rulep{X)) + CF{Rulepf{X)) < 1, which means that there may exist a 
certainty boundary for which we know nothing. 

Thus, for reduct B, we achieve two sets of classification rules, one for the pos- 
itive class and the other for the negative class, and each set consists of card^Xp) 
rules with different certainty factors. 

4 An Illustration 

In this section, we consider an example using the GRS model to learn classi- 
fication rules. Table 1 illustrates a set of training examples, U = {e*}, {i = 
1,2,..., 6). The set of condition attributes is C = {Ci,C 2 \ and their domains 
are Vc^ — {0, 1} and Vc -2 — {0? 1? 2}, respectively. The binary decision attribute 
is 17 = {0, 1}. Each tuple in the table is labeled as the positive class 1 with a 
certainty (column g), and assigned a importance degree (column d). 



Table 1. An example of an uncertain information system 
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Assume Pg — 0.85, Ng — 0.80. Initially, because all tuples are distinct, each 
of them forms an equivalent class with respect to the discernibility relation. 
Thus, we have six elementary sets, which are X\ — {ei},X 2 = {€ 2 },..., and 
Xq — {ee}. Compute their classification ratio as follows: 
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Since Cp{Xi) > CpiX^) > and Cn{Xs) > Cn{X6) > taking 
the concept as the entire table (becanse all tnples in U are labeled as the positive 
class, thongh, with nncertainty factor), we have POSo,s 5 {U\C) = {Xi,X4}, 
and NEGo Ro(U\C) = {Xs-Xb). The bonndary region is composed of other 
examples, PXPo.85,o.8o(/7|C) = {X2,Xs}. 

If we want the positive and negative regions to be more “pnre”, we can 
increase P^ and X^. Snppose P^ = 0.9, = 0.9, then we have Cp{Xi) > Pj^^ 

and Cn{Xq) > N^. Hence, POSo. 9 o{U\C) = {Xi}, NEGo. 9 o{U\C) = {Xe}, 
and PXPo.90,o.9o(f^|f^) = {X2, X3, X4, X5}. The eqnivalent classes X3 and X4 
are no longer good enongh to be in the positive region, so they are pnt in the 
bonndary and the positive region and negative region shrink. 

We can calcnlate the degree of dependency between the condition attribnte 
set G and the decision attribnte D with classification factors Pj^ — 0.85 and 
X^ = 0.80. According to above resnlts, the degree of dependency between G 
and D is, 7(C, P, 0.80, 0.85) = 0.73. 

Dropping the condition variable G\ from f7, we get a snbset G^ = {6^2}- 
Assnme Pj^ — 0.85 and X^ = 0.80 again. The discernibility eqnivalent classes 
are X\ — {€1,64}, X2 = {62,65} and X3 = {63,66} according to the condi- 
tion attribnte set G^ . Compnte Gp and Gn for each eqnivalent class as follows: 
Cp(Xi) = 0.90, Cat(Xi) = 0.10, Gp{X2) = 0.57, Gn{X2) = 0.43, Gp{X^) = 
0.125, and GniX^) = 0.875. 

It is easy to see that POSoMU\C^) = {^1} and N EGo.so{U\C^) = {^3}. 
Thns, we have 7(C^ P, 0.80, 0.85) = 0.73. 

From above, we know that '^{G , P, 0.80, 0.85) = 7(6", P, 0.80, 0.85), so P = 
{G 2 ] is a redact of G on P. 

Therefore, three classification rnles can be achieved from the three eqnivalent 
classes Xi,X2, and X3, respectively. 

Rnle 1: If P2 = 0 then P = 1 [positive class) with GE — 0.85. 

Rnle 2: If P2 = 1 then P = 1 [positive class) with GE — 0.47. 

Rnle 3: If P2 = 2 then P = 1 [positive class) with GE — 0.10. 

Similarly, the three converse classification rnles for the negative class can be 

obtained. 

5 Concluding Remarks 

In this paper we analyzed the limitations of the traditional rongh set theory 
and extended it to a generalized rongh sets model for modeling the classification 
process in the noise environment. We developed an approach for classification 
rnles indnction based on this generalized model. The illnstration shows that this 
approach works well for dealing with nnprecise training examples, especially the 
nncertain class labeling. This is crncial for the cases in which it is expensive 
or risk to correctly and precisely label the training examples. We also consid- 
ered the importance degrees of training examples. Different training examples 
may have different contribntions to the classification. In real world applications. 
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this is helpful for users to specify which are important examples and which are 
not. Generally, the typical examples recognized by domain experts are always 
important, while the examples automatically collected may not be so important. 

There are two questions which need to be addressed in our future research. 
We only discussed the simple binary classification in which the decision attribute 
has only two values of positive or negative class. For the multiple classifications 
we must solve such problems that how the uncertainty factor is attached to each 
label and how the final rules are built. In addition, we assumed that condition 
attributes are all discrete valued. For the numerical attributes the discretization 
must be performed before the induction procedure starts [3]. 

Another question is about the calculation of attribute reducts. As mentioned 
in the paper, verifying a reduct is straightforward, but finding all reducts is 
hard. There have been some approaches for attacking this problem in literature 
[2, 4, 5]. The results obtained are significant and encouraging. This will be one 
of our next research tasks. 
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Abstract. In today’s fast paced computerized world, many business organizations are 
overwhelmed with the huge amount of fast growing information. It is becoming 
difficult for ti'aditional database systems to manage the data effectively. Knowledge 
Discovery in Databases (KDD) and Data Mining became popular in the 1980s as 
solutions for this kind of data overload problem. In the past ten years, Rough Sets 
theory has been found to be a good mathematical approach for simplifying both the 
KDD and Data Mining processes. In tliis paper, KDD and Data Mining will be 
examined from a Rough Sets perspective. Based on the Rough Sets research on KDD 
that has been done at the University of Regina, we will describe the attribute- oriented 
approach to KDD. We will then describe the linkage between KDD and Rough Sets 
techniques and propose to unify KDD and Data Mining within a Rough Sets 
framework for better overall research acliievement. In the real world, the dirty data 
problem is a critical issue exists on many organizations. In this paper, we will describe 
in detail how this KDD with Rough Sets approach framework will be applied to solve 
a real world dirty data problem. 

1. Introduction 

Many businesses in today's world are overwhelmed with the huge amount of fast 
growing information. It is becoming more difficult for traditional systems to manage 
the data effectively. KDD and Data Mining became very popular in the 1980s in 
discovering useful information from data. The Rough Sets Theory is a mathematical 
approach for simplifying KDD. KDD and Data Mining have been adopted as 
solutions for better data management. In the real world, dirty data is a common 
problem existing in many organizations. Data cleaning is an important application in 
the KDD application areas. 

The rest of the paper is organized as follows. In Section 2, we overview Knowledge 
Discovery and Data Mining based on paper [1]. In Section 3, we discuss the research 
on KDD within a Rough Sets approach that has been done at the University of Regina 
as presented in papers [8] and [9]. In Section 4, we propose the idea of unification of 
knowledge discovery and data mining using the Rough Sets approach. In Section 5, 
we describe a real-world KDD application on data cleaning. Finally, in Section 6, we 
summarize the main ideas from the observations of KDD, Data Mining, and Rough 
Sets concepts, as well as present the conclusions. 

2. Definition of Knowledge Discovery and Data Mining 

The definitions for KDD, Data Mining, and the KDD Process ([1], p. 83) are given 
below: 

W. Ziarko and Y. Yao (Eds.): RSCTC 2000, LNAT 2005, pp 330-337, 2001 
© Springer-Verlag Berlin Heidelberg 2001 
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“Knowledge Discovery in Databases is the non-trivial process of identifying valid, 
novel, potentially useful, and ultimately understandable patterns in data,” 

“Data Mining is a step in the KDD process consisting of applying data analysis and 
discovery algorithms that, under acceptable computational efficiency limitations, 
produce a particular enumeration of patterns over the data.” 

KDD Process is the process of using the database along with any required 
selection, preprocessing, subsampling, and transformations of it; to apply data 
mining methods (algorithms) to enumerate patterns from it; and to evaluate the 
products of data mining to identify the subset of the enumerated patterns 
deemed ’’knowledge". 



3. Overview of Rough Sets and Knowledge Discovery 

The Rough Sets Theory was first introduced by Zdzislaw Pawlak in the 1980s and this 
theory provides a tool to deal with vagueness or uncertainty. A Rough Set is defined 
as a pair of sets, the lower approximation and upper approximation that approximate 
an arbitrary subset of the domain. The lower approximation consists of objects that 
are sure to belong to the subset of interest, where the upper approximation consists of 
objects possibly belonging to the subset [11]. 

In the past decade Rough Sets theory has become very popular in the research field of 
Knowledge Discovery and Data Mining. Rough Sets theory has been considered as a 
good mathematical approach for simplifying the KDD process. The Rough Sets 
model has been used in various KDD research areas such as marketing reseai'ch [9], 
industrial control [9], medicine, drug research [11], stock market analysis and others. 

In paper [10], Dr. Ziarko points out that the key idea in the Rough Sets approach is 
that the imprecise representation of data helps uncover data regulaiities. Knowledge 
discovered from data may be often incomplete or imprecise [10]. By using the Rough 
Sets model, data can be classified more easily using the lower approximation and 
upper approximation of a set concept [7]. Thus, data can be induced with the best 
description of the subset of interest, representing uncertain knowledge, identifying 
data dependencies, and using the discovered patterns to make inferences [7]. 

Various research on KDD within a Rough Sets approach has been done at the 
University of Regina. Paper [8] describes the use of Rough Sets as a tool for 
Knowledge Discovery. The key approach to KDD in this paper is the use of the 
attribute-oriented Rough Sets method. This method is based on the generalization of 
the information, which includes the examination of the data at different abstraction 
levels, followed by discovering, analyzing, and simplifying the relationships for the 
data. Rough Sets method of the reduction of knowledge is applied to eliminate the 
irrelevant attributes in the database. During the data analysis and reduction stages. 
Rough Sets techniques help to analyze the dependencies of the attributes and help to 
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identify the irrelevant attributes during the information reduction process. Thus, the 
Rough Sets approach helps to generate a minimum subset of the generalized attributes 
as well as helping the generalized information system to derive the rules about the 
data dependencies within the database. 

Rough Sets theory has been applied in various application areas in the Knowledge 
Discovery area. Some Rough Sets based software packages such as Datalogic have 
also appeared on the market [9]. In the real world, a number of systems have been 
developed and implemented for Data Mining purposes. DBMiner [3] is a real-world 
data mining and data warehouse system. A model of Rough Sets Data Miner 
(RSDM) implementation has also been proposed in paper [2] to unify the framework 
between the Data Mining and Rough Sets techniques. 

4. Unification of Knowledge Discovery and Data Mining using 
Rough Sets Approach 

Data Mining is an important step in the KDD process. In paper [1], the authors 
discuss the challenges of the KDD research field and conclude that there would be an 
advantage to unify the KDD process and Data Mining. The authors take a first step 
towards unification by providing a clear overview of KDD and Data Mining. Data 
Mining methods and the components of the Data Mining algorithms are also 
described. Following their concept, we would like to take the next step towards the 
unification of KDD and Data Mining. The next step that we are going to do is 
applying the KDD application using a Rough Sets approach. The attribute- oriented 
Rough Sets approach helps to simplify the data reduction process as well as to derive 
rules about data dependencies in the database. Applications described in paper [9] 
have also proved that Rough Sets theory can be used for identifying data regularities 
in the decision tables. RSDM and Datalogic software tool are two examples of this 
unification of KDD and Data Mining within a Rough Sets framework. Unification of 
these approaches has a great potential for the overall research achievement in KDD, 
Data Mining, and Rough Sets research areas. In the next section, we use a real-world 
KDD application in the area of data cleaning to discuss in more detail a specific KDD 
application using a Rough Set approach. 

5. KDD, Data Mining with Rough Sets Approach in the Real-World 
Application 

5.1 Background of XYZ and ABC & Overview of the ABC Computer System 

XYZ is a global consulting firm with its headquai'ters in Honolulu, Hawaii. XYZ 
provides the service in developing the pension softwaie systems for its clients. ABC 
is XYZ’s largest client. ABC is a shoes company with its headquarters in Wailuku, 
Maui. ABC consists of two main divisions: Administration and Benefits. 

Prior to the year 1996, ABC Administi*ation and Benefits division used two IBM 
mainframe systems running independently for its daily operations. At that time, the 
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Benefits system only supported the Health and Welfare administration and a payment 
system for the Health and Welfare payment. The pension administration and short- 
temi disability administration used manual processing. In late 1996, ABC requested 
XYZ to move their existing mainframe based systems to a new consolidated system 
under the PC Windows95 environment using Visual Basic as the development tool 
and the Microsoft SQL Server as the database system. Steps for restructuring the ABC 
systems are below: 

(1) . Convert the Administration system and the Benefits Health and Welfare 

systems from IBM mainframe to Visual Basic. 

(2) . Enhance the payment system, develop the Pension subsystem, Payee 

subsystem, and Weekly Indemnity (WI) subsystem, and consolidate them 
with the Health and Welfare subsystem as the new ABC Benefits system. 

(3) . Merge the Administration and Benefits systems as one consolidated system. 

5.2 Description of the ABC Data problems 

During the development and implementation for the ABC systems, various data 
problems arose. The first problem came from the data conversion from IBM 
mainframe system to Microsoft SQL Server database systems due to the inconsistent 
data from old systems. The second ABC data problem is due to lack of accurate 
information. For example, a temporary social security number (i.e. a member’s last 
name) will often be used for the new member when real SSN is not available. Other 
ABC data problems come from users’ typo errors, as well as data errors generated 
from the new systems due to the new systems software bugs. 

The Personal table is the most important table in the ABC Administration and 
Benefits database. The Personal table is the parent table of many child tables such as 
Hw_Membership (Health and Welfare Membership) table. When there are duplicated 
personal records in the Personal table, this might cause two duplicated 
Hw_Membership records for this member, as well as more data duplication errors for 
this member’s dental and medical claim payments. Thus, fixing the data in the 
Personal table is the most important starting point for cleaning the ABC data. 

5.3 Rough Sets Approach for Solving the ABC data problems 

As mentioned above, data cleaning for the Personal table is the key data cleaning 
issue for the ABC data problem. The Personal table has over 60 attributes and 
contains over 20,000 records. It is a time-consuming task to check through every 
single record. The patterns for duplicated personal records are also uncertain. Rough 
Sets theory will be a good approach for dealing with this Personal data duplication 
uncertainty. By adopting the attribute-oriented Rough Sets approach [4, 8], the most 
relevant attributes will be key in deciding the features of possible duplicated personal 
records. For example, the SSN, last_name, and birth_date are important attributes in 
detemiining the uniqueness of a person, however, the cell_phone_number attribute is 
considered irrelevant attributes in determining the uniqueness of a person. Thus, we 
eliminate the irrelevant Personal athibutes and only keep the relevant Personal 
attributes for further examination. 
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Using the Rough Sets theory to analyze the ABC data problem, an information table 
is constructed as below. The SSN, last_name, first_name, birth_date, gender, address 
are treated as the condition attributes. An abbreviation of “s” indicates two or more 
records contain the same values (i.e. same last name). The abbreviation of “d” 
indicates that two or more records contain different values (i.e. different birth date). 
The decision attribute “duplicated?” with y (yes) indicates that the examined personal 
records are referring to the same person and n (no) indicates different persons. 



Table 1: Information Table 
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To summarize, the indiscern ability classes for the information with respect to 
condition attributes SSN, last_name, first_name, birth_date, gender, address are {el }, 
|e2}, |e3, e4, e5}, |e6}, |e7, e8}. The indiscemability classes for the information 
table with respect to the attribute address deleted are also {el}, {e2j, {e3, e4, e5}, 
{e6}, {e7, e8j. When the indiscemability classes are the same for a set of condition 
attributes and a superset of those condition attributes, then any attribute which is a 
member of the superset, but not the subset, is redundant. 

Inconsistent information occurs in the above information table. For example, e3, e4, 
e5 have the same values for condition attribute address, but with different outcomes - 
e3 and e5 have a No value for the decision and e4 has a Yes value. Also, we see from 
the table that e7 and e8 belong to the same indiscemability class, but differ in the 
value of their decision attribute. All the examples in a given indiscemability class 
have the same values for their condition attributes. The reasons for the 
indiscemability classes {e3, e4, e5} and {e7, e8j, that is the reasons why the 
individuals within each of these classes are indiscernible, are explained below: 

(1) . The reason for e3 is that the data from the old system were stored incorrectly for a 
deceased member and his widow. Due to lack of personal information about the 
deceased member, the widow’s SSN and birth date were mistakenly stored for the 
deceased member. The reason for e4 is that two personal records for the same person 
came from two different subsystems. One subsystem contains the wrong first name 
and gender information for the person. The reason for e5 is that the data from the old 
system mixed up both the father’s and daughter’s personal information when both 
father and daughter are ABC members. 

(2) . The reason for e7 is also that two personal records for the same person came 
from two different subsystems. One subsystem has the wrong first name and birth 
date entered for the person. However, the reason for e8 is that two persons happen to 
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have the same last name and with the same temporary SSN stored as last name in the 
system. 

The following rules can be deduced based on the information table. 

Certain rules: (generated from example el, e2, and e6) 

(SSN, d) ^ (last_name, s) ^ (first_name, s) ^ (birth_date, s) ^ (gender, s) ^ (Duplicated, y) 

(SSN, d) ^ (last_name, d) ^ (first_name, s) ^ (birth_date, d) ^ (gender, d) => (Duplicated, n) 

(SSN, s) ^ (last_name, s) ^ (first_name, d) ^ (birth_date, s) ^ (gender, s) (Duplicated, y) 

Possible rules: 

(1) , (SSN, s) ^ (last_name, s) ^ (first_name, d) ^ (birth_date, s) ^ (gender, d) ^ (Duplicated, y) 

(2) . (SSN, s) (last_name, s) (first_name, d) ^ (birth_date, s) ^ (gender, d) ^ (Duplicated, n) 

(3) . (SSN, s) ^ (last_name, s) ^ (first_name, d) ^ (birth_date, d) ^ (gender, s) ^ (Duplicated, y) 

(4) . (SSN, s) ^ (last_name, s) ^ (first_name, d) ^ (birth_date, d) (gender, s) ^ (Duplicated, n) 

Rules (1) and (2) are deduced from example e3, e4, and e5. Rules (3) and (4) are 
deduced from example e7 and e8. Following the method suggested in [6], the 
strength of a rule for the uncertain rules is: 

# of positive examples covered by the rule 
# of examples covered by the rule (including both positive and negative) 

Based on this definition, rule 1 has a probability of 1/3. Example e4 is a positive 
example covered by this rule. Example e3 and e5 are negative examples covered by 
this rule. Similarly, the probability for rule 2 is 2/3, and for both rule 3 and 4 is Vi. 

5.4 Algorithm 

Our algorithm was inspired by the arguments in favor of an equational theory [5] to 
allow equivalence of individuals in the domain to be determined rather than simply 
value or string equivalence (i.e.. We want to compare actual persons rather than 
comparing attributes of persons). A KDD approach to data cleaning is taken in work 

(5) where the data is processed multiple times with different keys on each pass to 
assist in identifying individuals independent of their attribute values. Separate rules 
are used in algorithm [5] for name comparison, SSN comparison and address 
comparison. The results of each compai'ison are accumulated with the results of the 
next rule to be applied but all of the rules are precise rules. 

In contrast, we have combined uncertain rules within the general framework of 
algorithm [5] to solve the duplicated ABC Personal records problem. Algorithm [5] 
refers to an experimentally generated Personnel and Claims multi-database the fields 
of which bear much similarity to those of our ABC Personal records database. The 
rules that we have provided in Section 6.4.1 facilitate decision making when cleaning 
the duplicated personal records. Our algorithm is provided below: 

While scan through the whole table, for all possible duplicated personal records, 

Part 1: For certain rules. */ 




336 J. Johnson, M. Liu, and H. Chen 



Case a: (SSN, d) ^ (last_name, s) ^ (first_name, s) ^ (birth_date, s) ^ (gender, s) 
Merge the Personal records 
Purge the duplicated Personal records 

Case b: (SSN, s) ^ (last_name, s) (first_name, d) ^ (birth_date, s) ^ (gender, s) 
Merge the Personal records 
Purge the duplicated Personal records 
Part II: For possible rules 'V 

Case c: (SSN, s) (last_name, s) ^ (first_name, d) ^ (birth_date, s) ^ (gender, d) 
Insert the record into the exception report 

Case d: (SSN, s) ^ (last_name, s) (first_name, d) ^ (birth_date, d) ^ (gender, s) 
Insert the record into the exception report 

EndWhile 

Those personal records that satisfy certain rules are duplicated records, and thus will 
be merged. Those personal records that satisfy possible rules will be printed into an 
exception report. The exception report will be sent to the users to get further detail 
checking. For those records in the exception report, there is a 33% chance of 
duplicated records for Case c situation, and there is a 50% chance of duplicated 
records for Case d situation. 

User intervention is required for the data to which the uncertain rules apply. Users 
usually have to review the exception report and make a decision about data that may 
or may not be duplicated. However, in many situations, even users find it hard to 
distinguish whether or not the personal records are duplicated. In addition, the 
software often encounters bugs that prevent further normal operation of the system 
because of the dirty data. Data cleaning is a necessary task in order for the system to 
continue running smoothly. On the other hand, data recovery is a much more time- 
consuming and costly task. If the data are deleted by mistake, it will be much more 
difficult to recover the data than to clean the data. The strength of the rule will be 
useful in helping both the users and the consultants to make a decision about the pros 
and cons for cleaning the data. The strength of the rule for the rules 1, 2, 3, and 4 is 
33%, 66%, 50%, 50% respectively. The precedence of the strength of rule is Rule 2 > 
Rule 3 = Rule 4 > Rule 1. Thus, when proceeding with data cleaning for fixing the 
software bugs, we can make a decision to apply rule 2 first. If this fixes the software 
bug, we do not need to apply rules 1, 3, and 4. If the problem still cannot be solved, 
we will then apply rule 3 or 4. In the worst case, if the problem still exists after 
applying rule 3 or 4, then rule 1 will be applied. In conclusion, data cleaning using 
Rough Sets approach will help clean up the data more accurately, as well as help users 
make better decisions. The overall data quality will be better improved with less cost. 

6. Summary and Conclusions 

Researchers often do not distinguish between KDD and Data Mining. At the 
University of Regina, as we have seen in Section 3, research is being conducted that 
applies Rough Sets to KDD. Data Mining is a step within the KDD process for 
finding patterns in a reduced information table. It has been observed in paper [1] that 
it would be beneficial to clarify the relationship between Data Mining and KDD. 
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Although the Data Mining step of searching for patterns of interest in the data has 
received the most attention in the literature, in this paper we addressed additional 
steps in the KDD process. We have provided a common framework based on a Rough 
Sets approach for understanding the relationship between KDD and Data Mining. A 
real-world KDD application of the data cleaning was described in detail in a Rough 
Sets approach. Our conclusions are as follows: 

• From the literature [1], we conclude that Knowledge Discovery in Database and 
Data Mining are not the same thing 

• From the research done at the University of Regina, we have observed that KDD 
within a Rough Sets approach has great advantages for simplifying the KDD 
process 

• Unifying KDD and Data Mining within a Rough Set approach will benefit the 
knowledge discovery research overall achievement 

• KDD within a Rough Set approach has advantages for a real-world organization 
data cleaning problem 

Besides these advantages of the Rough Sets approach in the areas of KDD and Data 
Mining, there is still a lot of potential usage to be discovered in KDD and Data 
Mining areas from the Rough Sets approach. We are sure that more and more 
researchers have realized the importance of unifying KDD, Data Mining, and Rough 
Sets. 
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Abstract. This paper discusses the implementation of the Distance 
Learning Algorithm (DLA), which is derived from the Rough Set Based 
Inductive Learning Algorithm proposed by Wong and Ziarko in 1986. 
Rough Set Based Inductive Learning uses Rough Set theory to find gen- 
eral decision rules. Because this algorithm was not designed for distance 
learning, it was modified into the DLA to suit the distance learning re- 
quirements. In this paper, we discuss implementation issues. 



1 Introduction 

As distance education over the World Wide Web (WWW) becomes more and 
more popular, the lack of contact and feedback between online students and the 
instructor, inherent to the course delivery mode, becomes a growing concern. 
For this reason, the Distance Learning Algorithm (DLA) has been proposed to 
overcome some of the problems involved [2]. DLA is derived from the Rough Set 
Based Inductive Learning Algorithm [4]. Inductive Learning is a research area in 
Artificial Intelligence. It has been used to model the knowledge of human experts 
by using a carefully chosen sample of expert decisions to infer decision rules. 
Rough Set Based Inductive Learning uses Rough Set theory to compute general 
decision rules. Because the Rough Set-Based Inductive Learning algorithm was 
not designed for distance learning, we have modified it into DLA to fit distance 
learning situations. Furthermore, we have implemented it using Java to make it 
more portable in a distance delivery environment. 

In this paper, we discuss the implementation of the Distance Learning Al- 
gorithm. We illustrate how a Decision Tree can be used to find the reduced in- 
formation table. This paper is organized as follows. Section 2 gives an overview 
of distance education. Section 3 introduces Rough Sets and Inductive Learning. 
Section 4 describes the Distance Learning Algorithm. Section 5 discusses the 
implementation of DLA. Section 6 concludes the paper and points out future 
work. 

2 Overview of Distance Education 

Distance Education is a form of education that is perhaps easiest to classify by 
describing it as the sum of delivery methodologies that have moved away from 
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the traditional classroom environment. From the simple correspondence course, 
broadcast TV with reverse audio, to specialized video-conferencing tools, such as 
Proshare or Flashback, and web-based course [7], distance-based education has 
helped many people obtain college credits, complete training, or update knowl- 
edge to adapt to the new information society. Distance Education differs from 
traditional education in many ways. It offers educational programs to facilitate 
a learning strategy without day-to-day contact teaching. It provides interactive 
study materials and decentralized learning facilities where students seek aca- 
demic and other forms of educational assistance when they need it. With the 
recent popularity of the WWW, web-based distance education has become more 
and more popular due to its convenience, low cost, suitability to delivery at a 
distance, and flexible time scheduling features. 



3 Rough Sets and Inductive Learning 

Rough Set theory was introduced by Zdzislaw Pawlak in the early 1980s [?] as 
a mathematical tool for approximate modeling of classiflcation problems. It is 
based on an information table that is used to represent those parts of reality that 
constitute our domain of interest. Given an information table of examples U with 
distinct attributes. Rough Set theory allows us to classify the information table in 
two different ways: by the condition attributes and by a decision attribute in the 
information table to And equivalence classes called indiscernability classes U — 
{Vi, ..., Objects within a given indiscernability class are indistinguishable 
from each other on the basis of those attribute values. Each equivalence class 
based on the decision attribute deflnes a concept. We use Des[Xi) to denote the 
description, i.e., the set of attribute values, of the equivalence class Xi, 

Rough Set theory allows a concept to be described in terms of a pair of sets, 
lower approximation and upper approximation of the class. 

Let F be a concept. The lower approximation^ and the upper approximation 
y of y are deflned as 

y = {e G V I e G V, and V, C y} 
y = {e G V I e G V. and V, ny 7 ^ 0} 

In other words, the lower approximation is the intersection of all those equiv- 
alence classes that are contained by Y and the upper approximation is the union 
of all those equivalence classes that have a non-empty intersection with Y , 

If an element is in y — y, we cannot be certain if it is in y. Therefore, the 
notion of a discriminant index of Y has been introduced to measure the degree 
of certainty in determining whether or not elements in U are members of Y [5]. 
The discriminant index of Y is deflned as follows: 

For the information table, not all condition attributes are necessary to classify 
the information table. There exists a minimal subset of condition attributes that 
suffices for the classiflcation. The process used to obtain such a subset is called 
knowledge reduction [3], which is the essential part of Rough Set theory. The 
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minimal set of such essential condition attributes is called the reduct of the 
information table. 

To describe this more clearly, let us first define a concept of a positive region of 
a classification with respect to another classification. For mathematical reasons, 
we use equivalence relations here instead of classifications. 

Let R be an equivalence relation over U . We use U /R to denote the family 
of all equivalence classes of R. Let R be a family of equivalence relations over 
U . Then HR, the intersection of all equivalence relations belonging to R, is also 
an equivalence relation and is denoted by iWL^(R). Then U / IN D{^) is the 
family of all equivalence classes of the equivalence relation /iVI}(R). Let R be 
an equivalence relation in R. Then R is indispensable if U / IN D{^ — R) ^ 
U /IND{H). If every R G R is indispensable, then U /IND{H) is a reduct of U , 

Let Q be the equivalence relation over U and F a concept, which is an 
equivalence class based on the decision attribute. The P-positive region of Q? 
denoted POSp[Q)^ is defined as follows: 

POSp{Q) = Uxeu/Q^ 

In other words, the T-positive region of Q is the set of all objects of the uni- 
verse U which can be classified to classes of UjQ employing knowledge expressed 
by the concept P. 

Let P be a family of equivalence relations over U and Q an equivalence 
relation. Then an equivalence relation P G P is Q- dispensable in P if 

POSj^^^p^{Q) = 

Otherwise, R is Q-i^^dispensable in P. 

If every P in P is Q-dispensable in P, we say that P is Q-i^^dependent (or 
P is independent with respect to Q). 

The family S' C P is called a Q-i'^duct of P, if and only if S is the Q- 
independent subfamily of P, and POSs{Q) = POSp{Q). 

Inductive Learning, a research area in Artificial Intelligence, is used to model 
the knowledge of human experts by using a carefully chosen sample of expert 
decisions and by inferring decision rules automatically, independent of the sub- 
ject of interest. Rough set based Inductive Learning uses Rough Set theory to 
compute general decision rules [4, 6]. Their closeness determines the relationship 
between the set of attributes and the concept. 

4 Distance Learning Algorithm 

The Inductive Learning Algorithm proposed by Wong and Ziarko [4] was not 
designed for a distance learning situation for the following reasons. First, it 
only outputs deterministic rules at the intermediate level. For distance educa- 
tion, nondeterministic rules at the intermediate step can inform online students 
about important information useful in guiding their distance learning. Second, 
for distance education, we are primarily interested in one concept F, such that 
Des(Y) = {Pai/}, i.e., the failure concept. We want to find out what causes 
online students to fail. In contrast, the Inductive Learning Algorithm covers 
multiple concepts. Thus, we have adapted the Inductive Learning Algorithm 
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to a distance education environment and the result is the DLA [2]. Unlike the 
Inductive Learning Algorithm, DLA calculates the reduct as well. 

The DLA algorithm has four steps. Step 1 computes the reduct. Step 2 initial- 
izes variables. As long as the number of condition attributes is not zero, Step 
3, the while loop, is executed. Inside the while loop, the discriminant indices of 
condition attributes are computed first, then the highest index value is stored 
into result and this condition attribute is removed from the condition attributes. 
Next, the algorithm outputs the deterministic or non-deterministic decision rules 
and determines a new domain. If the new domain is empty, the algorithm ex- 
ecutes Step 4 that outputs all the deterministic or non-deterministic decision 
rules [2]. 

5 Implementation of the Distance Learning Algorithm 

Before discussing the implementation, we first introduce the problem. There is a 
text file that contains students’ records including assignments, quizzes, and one 
final examination. Failure in the final examination means failure in the course. 
Therefore, the main issue here is that we want to find out what material students 
failed to understand as shown by their results in assignments and quizzes. The 
assumption is that the lack of understanding of key areas of the course material 
results in failure on the final exam. Thus, we use a table of results from course 
work to determine the rules associated with failure on the final exam. Then we 
can inform students in subsequent courses of the core sections of the course and 
provide guidance for online students. The following table is from a Java class at 
the University of Regina in which students use the Web for their course material. 
There are 115 students, 6 quizzes, and one final examination: 





Quiz 1 


Quiz 2 


Quiz 3 


Quiz 4 


Quiz 5 


Quiz 6 


FINAL 


51 


98 


100 


90 


89 


91 


85 


90 


52 


100 


90 


30 


45 


55 


32 


40 


53 


68 


70 


80 


89 


91 


85 


85 


54 


88 


100 


80 


69 


75 


85 


81 


55 


76 


65 


50 


70 


46 


49 


46 


5115 


58 


70 


90 


60 


78 


97 


87 



We chose the Java programming language to do the implementation. The 
reasons for choosing this programming language included the following factors. 
First, Java is becoming more and more popular. Second, it has various utilities 
that can simplify the programming task. Third, Java aims to become a univer- 
sal programming language running on a wide variety of machines. Finally, it 
can easily be used on the web applications. To implement this work, we divide 
the task into five classes. The five classes are: RoughSet class, ReadData class. 
Reduct class. Inductive class, and Strength class. 

RoughSet Class This class imports the ReadData class. Reduct class. Inductive 
class, and Strength class. It contains the main method that makes this class the 
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executable class. It defines several two dimensional arrays to hold the converted 
information table (students’ records) and calls the methods that are defined in 
the imported classes. 

ReadData Class This class reads the student information needed from any text 
file, converts percentage marks into pass or fail marks, combines entries with the 
same values, and generates a small table. The StringTokenizer class predefined 
in the Java language is used to process the text file. Using Table 1 as our exam- 
ple, the ReadData class reads Table 1, combines the students, and derives the 
collapsed student information table that contains the eight sample classes show 
in the table below. The total number of students in Table 2 is 104. The other 
11 students received failure scores in all areas and do not need to be considered 
here. 





Quiz 1 


Quiz 2 


Quiz 3 


Quiz 4 


Quiz 5 


Quiz 6 


FINAL 


dotal 


ei 


P 


P 


P 


P 


P 


P 


P 


76 


62 


P 


P 


f 


f 


P 


f 


f 


4 


63 


P 


f 


f 


P 


P 


f 


f 


3 


64 


P 


P 


f 


P 


P 


P 


P 


6 


65 


P 


P 


f 


P 


P 


f 


f 


1 


66 


P 


P 


P 


f 


f 


f 


f 


5 


67 


f 


P 


P 


P 


P 


P 


P 


8 


68 


P 


P 


P 


P 


P 


f 


P 


10 



Reduct Class The ReadData class has successfully obtained the collapsed table, 
but it may contain redundant condition attributes, which means that some con- 
dition attributes may not determine the decision at all. Therefore, we need to 
find the reduct and remove the redundant attributes. 

To implement this step, linear searching and sorting algorithms are used 
to accomplish this task and a three dimensional array is used to hold the 
equivalence relations; that is, the family R = {Ql, Q2, QJ, Q4, Qfi, Qh, Final}. 
The positive region of R is held in an one dimensional array. POS-£i{F) = 
{ei, 62, 63, 64, 65, 66, 67, eg} 

In order to compute the reduct of R with respect to Final, we then have to 
find out whether the family R is F-dependent or not. We use a while loop to 
remove Ql, Q2, Q3, Q4, Q5, and Q6 respectively and check the positive region 
in the mean time. If the positive region of each quiz is different from the positive 
region of Final, we keep this condition attribute in order to find the reduct. The 
following is the partial result: 

U/IND{'R- {Ql}) = {{61, 67}, {62}, {63}, {64}, {65}, {ce}, {eg}} 

POS'R_{gi}(T) = (ci, 62, 63, 64, 65, 66, 67, 6g} = POS-£i{F) 

U/IND{'R - {Q2}) = {{61}, {62}, {63, 65}, {64}, {66}, {67}, {6g}} 

POS'R_{g2}(^0 = {6l, 62, 63, 64, 65, 66, 67, 6g} = POS^{F) . . . 

U/IND{'K — {Q6}) = {{61, 67, 6g}, {62, 63, 64, 65}, {66}} 

POS'R_{g6}(^ ) ^ ^6, 67, 6g} 7^ ) 

Because each positive region of Quiz 3, Quiz 5, and Quiz 6 does not equal to 
the positive region of Final, Quiz3, Quiz5 and Quiz6 are indispensable. There- 
fore, we obtain the reduct in a two-dimensional array as follow: 
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ei 
62 

63 

64 

65 

66 

67 

68 

Inductive Class The remaining work required is to find the decision rules. To 
find the decision rules, Inductive Class uses several arrays repeatedly to hold the 
domain, concept, indiscernibility classes, lower and upper approximations, and 
discriminant indices in a while loop. During the computation inside the loop, 
linear searching and sorting algorithms are used many times in order to obtain 
the correct result. The routine then shows the alpha values, i.e, the discriminant 
indices. The following is the outcome: 

Domain: U = {ei, C2, 63, 64, 65, ee, er, eg} 

Failure Concept: T = {c2, 63, 65, ee} 

The indiscernibility classes based on Quiz 3 are: 

Yi = {ei, ee, er, eg} Des{Yi) = {Quiz3 = F} 

Y2 = {e2, 63 , e4, ee} Des{Y2) = {Quiz3 = F} 

upper approximation: Y = {ei, ee, er, eg, 02, 03, 04, ee} 

lower approximation: H = {} 

discriminant index: ag^ = 1 — \ Y — }1|/|C| = 0.00 

The rest are computed similarly and we obtain the results shown in the 
following table: 



Attribute 


a value 


Quiz 3 


0.00 


Quiz 5 


0.125 


Quiz 6 


0.375 



Because Quiz 6 has the highest discriminant index that best determines its 
membership in the concept T, we obtain the first decision rule. 

Rulel : {Quiz 6 = /} ^ {Final = /} 

The decision tree below is used here to find the new domain which is the 
same as the formula mentioned in the Distance Learning Algorithm: 

F^ = F^ -[{F^ -7)uY] 

{el:p,e4:p,e7:p} {e2:f,e3:f,e5:f,e6:f,e8:p} 

Based on the decision tree, we obtain the horizontally reduced table as fol- 
lows: 
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Quiz 3 


Quiz 5 


Quiz 6 


FINAL 


Total 


C2 


f 


P 


f 


f 


4 


C3 


f 


P 


f 


f 


3 


C5 


f 


P 


f 


f 


1 


C6 


P 


f 


f 


f 


5 


e& 


P 


P 


f 


P 


10 



We then merge Quiz 6 with the rest of the condition attributes to find the 
highest discriminant indices. That produces two combinations: Quiz 3 and Quiz 
6, Quiz 5 and Quiz 6. The following table shows the result of the second round 
of the program: 



Attribute 


a value 


Quizzes 3 and 6 


0.60 


Quizzes 5 and 6 


0.20 



Because Quizzes 3 and 6 have the highest discriminant index, we thus obtain 
the second decision rule: 

Rule2 : {Quiz3 = /, Quiz6 = /} ^ {Final = /} 

The resultant decision tree is shown in the following figure: 

Q6 




By applying the same method, we finally obtain the last rule: 

Rule3 : {Quiz5 = /, Quiz6 = /} ^ {Final = /} 

Strength Class We have the rules available now, but how strongly should online 
students believe in these rules? The Strength class answers that question. It finds 
the strength of each rule by using the following formula: 

^ of positive cases covered by the rule 
^ of cases covered by the rule [including both positive and negative) 

To implement the last step, we use the linear search algorithm again. The 
strength of the rules is finally held in an one-dimensional array and at the end 
of the computation, the program posts this information about the rules, as seen 
in the following table: 



Rules Strength 

Rl: {Quiz 6 = f} {Final = f} 56.52% 

R2: {Quiz 3 = f, Quiz 6 = f} {Final = f} 100.0% 

R2: {Quiz 5 = f, Quiz 6 = f} {Final = f} 100.0% 



The information shown in the above table tells readers that the rules measure 
previous online students’ performance. It guides repeating and new online stu- 
dents in focusing their studies and provides information to the course instructor 
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about whether the online materials need to be modified or reordered. The first 
rule has a probability of 56.52%.h That means if a student fails Quiz 6, he or she 
has a 56.52% possibility of failing the final. The Second and third rules say that if 
a student fails to understand the material related to Quizzes 3 and 6, or Quizzes 

5 and 6, then the student has a 100% possibility of failing the course. Therefore, 
the materials related to Quiz 3, 5, and 6 are the core materials for this online 
class. As the quiz 6 related materials are so important, the instructor, therefore, 
might wish to provide additional examples to reinforce the understandability of 
the material. 

6 Conclusion 

This paper demonstrates that by applying Rough Sets to distance education 
over the World Wide Web, the problem of limited feedback can be improved. 
This makes distance education more useful. Rough Sets base distance learning 
allows decision rules to be induced that are important to both students and 
instructors. It thus guides students in their learning. For repeating students, it 
specifies the areas they should focus on according to the rules applied to them. 
For new students, it tells them which sections need extra effort in order to pass 
the course. Rough Sets based distance education can also guide the instructor 
about the best order in which to present the material. Based on the DLA results, 
the instructor may reorganize or rewrite the course notes by providing more 
examples and explaining more concepts. Therefore, Rough Sets based distance 
education improves the state-of-the-art of Web learning by providing virtual 
student /teacher feedback and making distance education more effective. 
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1 Introduction 

This paper applies rough set theory to find decision rules using ORACLE RDBMS 
The major steps include elimination of redundant attributes and of redundant 
attribute values for each tuple. In this paper three algorithms to extract high 
frequency decision rules from very large decision tables are presented. One ah 
gorithm uses pure SQL syntax. The other two use sorting algorithm and binary 
tree data structure respectively. The performances among these methods are 
evaluated and compared. The use of binary tree structure improves the compu- 
tational time tremendously. So pure SQL results indicate some major change to 
query optimizer may be desirable if it will be used for data mnining. 

2 Rough Sets Methodology 

A decision table is a variation of a relation [3], in which we distinguish two kinds 
of variables: condition attributes (denoted as COND) and decision attributes 
(denoted as DECS); each row represents a decision rule; see Table 1. 





Condition attributes 


Decision attributes 


Rule # 


TEST 


LOW 


HIGH 


NEW 


CASE 


RESULT 


R1 


1 


0 


0 


11 


2 


1 


R2 


4 


0 


1 


11 


6 


3 


R3 


0 


1 


1 


10 


2 


1 


R4 


0 


1 


1 


12 


20 


10 


R5 


1 


1 


0 


12 


20 


10 


R6 


1 


1 


0 


23 


60 


30 


R7 


1 


0 


0 


23 


60 


88 



Table 1. Decision table 



The most important concept in rough set theory is reduct. A relative reduct 
is a minimal subset of condition attributes that can at least classify the table 
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into the same equivalence classes as decision attributes [1]. For example, the set 
{LOW, CASE} is one of relative reducts of Table 1. 

Applying rough set methodology for data mining is about the reduction or 
simplification of a decision table: 

1. Step 1: elimination of redundant columns. This is to search for relative 
reducts 

2. Step 2: elimination of redundant values for each rule. This is to search for 
value reducts. 

Step 1; search relative reducts 

Finding relative reducts is to find the minimal subsets S of COND which 
decision attributes are (full) functional dependent on, i.e. 5 ==> DECS[S\. 
However, most databases contain noise, so partial dependency with degree fe, is 
more practical approach. Therefore, the basic algorithm to search for relative 
reducts is to compute the number of consistent rules with respect to 5, cs, then 
compute the degree of dependency, E — c^/ (total rows of database). If >= fc, 
5 is a relative reduct. Otherwise it is not. 

Since for each entity, we have to scan the whole database once to decide 
whether it is a consistent rule or not, the running time for the above algorithm 
is always 0{n?)j where n is the number of entities of the database. Here, we 
present a faster method. For a partial dependency with degree fc, the maximal 
inconsistent rules allowed, called E/is E = n*(l — fe). The number of inconsistent 
rules actually appearing inside the database is denoted as exCount. Obviously 
when exCount > E the subset S to be checked is not a relative reduct; we can 
stop checking the rest entities. 

Step 2; search value reducts. 

To illustrate the idea, we will use Table 1 and one of its relative reduct 
{TEST, LOW, NEW}. For rule Rl, the minimal condition is {TEST, NEW}. 
That is its value reduct. We present all value reducts of relative reduct {TEST, 
LOW, NEW} in the Table 2. In other words. Table 2 represents the minimal 
conditions for each rule. 





Condition attributes 


Decision attributes 


Rule ^ 


TEST 


LOW 


HIGH 


NEW 


CASE 


RESULT 


Rl 


1 






11 




1 


R2 


4 










3 


R3 








10 




1 


R4 








12 




10 


R5 








12 




10 


R6 




1 




23 




30 


R7 




0 




23 




88 



Table 2. Value redacts of Table 1 
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We summarize the features of value reducts of a rule as follows. 

1. For a value subset, if it is unique in this rule, it is a value reduct of this rule. 
For example, {TEST=1, NEW=11} is unique for rule Rl. Using this only 
can determine rule Rl; no other attributes are needed. 

2. For a value subset, if it is not unique in this rule, but it always implies the 
same decision attributes values, it is also a value reduct. For example, in both 
rule R4 and R5, we have {NEW=12} ==> {RESULT=10}. Therefore, it is 
the value reduct for both R4 and R5. 

The value reducts (or decision rules) we discuss so far have features: 

1. all rules are included even though they only appear once in the database; 

2. all rules are strictly consistent. That is, if {NEW=23} ==> {RESULT=30} 
it can not imply other values of RESULT. 

This kind of decision rules only work for very clean databases. Actual databases 
usually contain noises or errors. In the subsequent parts of this paper, we will 
consider a more useful definition of decision rules, which has the following fea- 
tures. 

1. Those rules that appear just a few times are not considered. Only high 
frequency rules are considered, that is, the number of times it appears must 
be larger than a threshold, rninsup. [2] 

2. The appearance of errors is considered. For example, if {NEW=23} ==> 
{RESULT=30} appears 99 times, and other cases also exist, such as {NEW=23} 
==> {RESULT=77}, but it only appears once. We say the former has confi- 
dence 99%; the latter 1%[4]. When a database contains errors, we consider a 
rule value reduct (decision rule) as long as its confidence is above a specified 
minimal confidence, rninconf. 

The rest of this paper will present algorithms to search decision rules that 
satisfy the above conditions. 

3 Algorithm 1: generic SQL implementation 

We start from a database-oriented solution, i.e. using SQL queries for database 
mining. In the following part, we assume the database to be mined is table T, 
the set of condition attributes of table T, COiVD, include cl, c2, ... cm, and 
decision attributes, DECS, include dl, d2,..., dn. 

Step 1; search for relative reducts 

For a subset S of C to be checked, assuming it is { si, s2,..., si }, at first, we 
will group the table T by S, and get a temporary table tmpl. Table tmpl will 
contain all different groups by S. The SQL is as follows. 

CREATE TABLE tmpl as 
SELECT si, s2,..., si, count(*) AS cl 
FROM T 




Searching Decision Rules in Very Large Databases Using Rough Set Theory 349 



GROUP BY si, si; 

Secondly, we group the table T by both S and decision attributes, and get 
temporary table tmp2. Table tmp2 will contain all different groups for S+DECS 
The SQL is as follows. 

CREATE TABLE tmp2 as 

SELECT si, s2,..., si , count(*) AS c2 

FROM T 

GROUP BY si, s2,..., si , dl, d2,...,dn; 

Next, we pick up inconsistent rules, which should have different count (*) 
values in table tmpl and tmp2; and the sum of the count(*) values of these 
inconsistent rules in table tmpl is the total number of inconsistent rules. The 
SQL is as follows. 

SELECT SUM(tmpl.cl) AS exCount 

FROM tmpl, tmp2 

WHERE tmpl. si = tmp2.sl 

AND tmpl.sj = tmp2.sj 

AND tmpl. cl = tmp2.c2 

If exCount <= E, the maximuminconsistentrules introduced in section 3, 
S is a relative reduct. 

Here, we will use Table 1 to explain the above operations. After we group 
Table 1 by CASE, the table tmpl and tmp2 we get are as follows. 



Table tmpl | 


CASE 


Count(*) AS cl 


2 


2 


6 


1 


20 


2 


60 


2 



Table tmp2 | 


CASE 


Count(*) AS c2 


2 


2 


6 


1 


20 


2 


60 


1 


60 


1 



We will pick up the following row from tmpl. 
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CASE 


Couiit(*) 


60 


2 



This shows that inconsistent rules are those rules with {CASE = 60}, and there 
are totally two inconsistent rules. Thus the exCount is 2. If E is equal to 2, 
{CASE} is a relative reduct. 

Step 2; search for value reducts 

To search for value reducts from a relative reduct, all of its subsets are 
checked. The operations will be explained using an example. For Table 1, af- 
ter the relative reduct {TEST, LOW, NEW} is obtained, here how to get value 
reducts related with {NEW} is shown as follows. It is assumed that, rninsup = 
3, and rninconf = 70%. 

At first, we group table T by {NEW, RESULT}, and record the counts. SQL 
is as follows, 

CREATE TABLE tmpA as 

SELECT NEW, RESULT, count(*)as countA 

FROM T 

GROUP by NEW, RESULT 

Assume the table tmpA we get is as follows (a little different from the original 
one). 



Table tmpA | 


NEW 


RESULT 


countA 


11 


1 


2 


11 


3 


8 


10 


1 


1 


12 


10 


3 


23 


30 


4 



Next, we group the table T only by {NEW}, and also record the count. SQL 
is as follows, 

CREATE TABLE tmpA as 
SELECT NEW, count(*)as countB 
FROM T 
GROUP by NEW 
we get the table tmpB. 

Next, we will pick up value reducts, which are some tuples of table tmpA. These 
tuples have identical values in the subset {NEW} between tmpA and tmpB, 
countA > rninsup, and confidence > rninconf. For example, in table tmpA, we 
will pick up the following tuples. 

The SQL we use in this step is as follows, 

SELECT tmpA. NEW, tmpA. RESULT, tmpA. countA, tmpA. countA / tmpB. countB 
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Table tmpB | 


NEW 


countB 


11 


10 


10 


1 


12 


3 


23 


4 



NEW 


RESULT 


support 


confidence 


11 


3 


8 


80% 


12 


10 


3 


100% 


23 


30 


4 


100% 



FROM tmpA, tmpB 

where tmpA.NEW=tmpB.NEW 

and tmpAxountA >= minsup 

and tmpAxountA / tmpBxountB >= minconf; 

Step 3; removing superset from value reduct table. 

The last step for mining decision rules is to remove those supersets from 
value reduct table. The SQL we use is as follows. Here, we use a special function 
provided by ORACLE RDBMS, NVL(x, y), which returns x if x is not null, 
otherwise y. 

DELETE FROM valReductTb tl 
WHERE ROWID > ( 

SELECT min(rowid) 

FROM valReductTb t2 

WHERE tl.decs=t2.decs 

and ((t2.conds is null and tl.conds is null) 

or NVL(tl.conds, t2.conds)=NVL(t2.conds, tl.conds)) 

) 

Analysis 

Mining decision rules need to loop all subsets of condition attributes. There- 
fore, SQL queries for data mining are very long and complicated because it does 
not provide a for or do/ while repetition structure like C. This also forces us to 
use dynamic SQL. On the other hand, SQL data mining has bad performance. 
When checking whether {NEW} is a relative reduct, once exCount > E, further 
checking the rest of rows should be stopped immediately. However, using SQL, 
the checking is always executed to all rows. If assuming N is the number of rows, 
the running time for searching relative reducts is calculated as follows. 

1. For two ” group by” operations, the time complexity is 0(2NlgN). 

2. For a ” table join” operation, the average running time is O(NlgN). 

Thus, the total running time is 0{3NlgN) for one subset. For a decision 
table, usually most of COND subsets are not relative reducts, so SQL method 
wastes a lot of time on non-reduct COND subsets. 
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Searching value reducts uses the similar methods: two ” group by” operations, 
then ” table join”. Thus the running time for searching value reducts for each 
COND subset is also 0{3NlgN). 

4 Algorithm 2: a faster algorithm using sorting 

Before we start checking a subset S is a relative reduct or not, we sort the 
table by S using an nlog(n) sorting algorithm. Now, starting from the first row, 
through comparing two neighboring rows we can find inconsistent rules easily. 
A counter exCount is set to record the number of inconsistent rules. For the kth 
and (k+l)th row, if they have the same values in attributes { si, s2, ..sp } but 
different in decision attributes, they are inconsistent rules; exCount is increased 
by one. If exCount > { si, s2, ..sp } is not a relative reduct, and checking is 

stopped immediately; 

Assuming N is the number of rows of the table to be mined, this algorithm 
gives us running time 0(NlgN+E) in the best case and 0(NlgN+N) in the worst 
case during checking relative reducts. Since most of COND subsets are not rel- 
ative reducts, the worst case does not happen a lot, and actual running time for 
this algorithm is much faster than the worst case. 

This sorting method can also be used on searching value reducts. However, as 
we show before, searching value reducts requires checking every row, no matter 
in the best case or worst case. Thus the total running time is 0(NlgN + N). 
Comparing it with the first algorithm, this does not improve a lot. 

5 Algorithm 3: a much faster algorithm using tree 
struct ore* 

For algorithm 2, the running time for sorting always exists. Sometimes, this 
sorting is not necessary. For example, if the first five rows are inconsistent rules, 
then exCount = 5. If the maximum exceptions allowed, E, is 4, we immediately 
know that {NEW} is not a relative reduct, then stop. In this case, sorting is not 
necessary at all. 

In this section, we study how to avoid sorting process and get better per- 
formance. We will use a binary-search-tree data structure. Each node of the 
tree represents the information of an entity. Assume S is the subset of condition 
attributes to be checked. The node of this binary tree has the form: 
struct node { values of S, 
total number of rules, 
list of values of decision attributes, 

}, 

Before checking relative reducts, we initialize exCount = 0. When an entity 
is checked, a node is built, the tree is searched, and the node is inserted into 
the tree according to the values of S. During this process, there will be following 
cases happening: 
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1. if no node in the tree has identical values of S to the new node, the new node 
is inserted into the tree; 

2. if one node in the tree has identical values of S to the new node, in its 
decision attributes’ value list, as long as we find a different value from the 
new node’s values, the total number of rules for this node will be computed 
as part of exCount. If exCount is larger than E, S is not a relative reduct. 
We stop immediately. 

This algorithm does not require sorting the table. For the best case, the 
running time is 0{E). For the worst case, the running time is 0{NlgN) when 
the tree is balanced, but O(iV^) when the tree is linear. In a reahworld deci- 
sion table, there are not many relative reducts, so the worst case happens few 
times. Comparing with algorithm 1 and 2, this algorithm greatly improves the 
performance of searching relative reducts. 

This algorithm can also be used on searching value reducts. However, we 
always need to build a tree using all entities so the running time is 0{NlgN) or 
O(iV^). This tree-structure algorithm does not improve a lot on the second step. 

6 Conclusion 

1. Rough set methodology is a good tool for mining decision rules from tradi- 
tional RDBMS, it can be adapted to different noise levels. 

2. Traditional SQL can be used for database mining, but is costly; some ad- 
justments to query optimizers deem necessary. 

3. An algorithm using sorting method gives a little better performance. 

4. A much faster algorithm using binary-search-tree data structure is devel- 
oped, which avoids the complete sorting process. The loop stops as soon as 
it finds sufficient inconsistent tuples. 
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Abstract. The main task in decision tree construction algorithms is to 
find the ’’best partition” of the set of objects. In this paper, we investigate 
the problem of optimal binary partition of continuous attribute for large 
data sets stored in relational databases. The critical for time complexity 
of algorithms solving this problem is the number of simple SQL queries 
necessary to construct such partitions. The straightforward approach to 
optimal partition selection needs at least 0(N) queries, where N is the 
number of pre- assumed partitions of the searching space. We show some 
properties of optimization measures related to discernibility between ob- 
jects, that allow to construct the partition very close to optimal using 
only 0(log N) simple queries. 

Key words: Data Mining, Rough set. Decision tree. 



1 Introduction 

The philosophy of ” rough set” data analysis methods is based on handling of 
discernibility between objects (see [9, 13]). In recent years, one can find a number 
of applications of rough set theory in Machine Learning, Data Mining or KDD. 
One of the main tasks in data mining is the classification problem. The most two 
popular approaches to classification problems are ” decision tree” and ’’decision 
rule set”. Most ’’rough set” methods are dealing with classification problem by 
extracting a set of decision rules (see [14, 13,10]). We have shown in previous 
papers that the well known discernibility property in rough set theory can be 
used to build decision tree with high accuracy from data. 

The main step in methods of decision tree construction is to find optimal 
partitions of the set of objects. The problem of searching for optimal partitions 
of real value attributes, defined by so called cuts, has been studied by many 
authors (see e.g. [1-3, 11]), where optimization criteria are defined by e.g. height 
of the obtained decision tree or the number of cuts. In general, all those problems 
are hard from computational point of view. Hence numerous heuristics have been 
investigated to develop approximate solutions of these problems. One of major 
tasks of these heuristics is to define some approximate measures estimating the 
quality of extracted cuts. In rough set and Boolean reasoning based methods, 
the quality is defined by the number of pairs of objects discerned by the partition 
(called discernibility measure). 

W. Ziarko and Y. Yao (Eds.): RSCTC 2000, LNAI 2005, pp. 354-361, 2001. 
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We consider the problem of searching for optimal partition of real value 
attributes assuming that the large data table is represented in relational data 
base. The straightforward approach to optimal partition selection needs at least 
0{N) simple queries, where N is the number of pre-assumed partitions of the 
searching space. In this case, even the linear complexity is not acceptable because 
of the time for one step (see [5,8]). We assume that the answer time for such 
queries does not depend on the interval length. We show some properties of 
considered optimization measures allowing to reduce the size of searching space. 
Moreover, we prove that using only 0{logN) simple queries, one can construct 
the partition very close to optimal. We have showing that the main part of the 
formula estimating the quality of the best cut for independent variables from 
[ 8 ] is the same in case of ” fully” dependent variables. Comparing [ 8 ], we also 
extend the algorithm of searching for the best cut by adding the global searching 
strategy. 



2 Basic notions 

An information system. [9] is a pair A = (C, A), where C is a non-empty, finite 
set called the universe and A is a non-empty finite set of attributes (or features)^ 

i.e. a : U ^ W for a G A, where W is called the value set of a. Elements of U 
are called objects or records. Two objects x^y ^ U are said to be discernible by 
attributes from A if there exists an attribute a G A such that a{x) ^ ct{y). Any 
information system of the form A = (C, AU {dec}) is called decision table where 
dec ^ A is called decision attribute. Without loss of generality we assume that 
V(iec = {1, . . . , d}. Then the set DECy. = {x G C : dec{x) = k} will be called the 
decision class of A for 1 < A: < d. Any pair (a, c), where a is an attribute 
and c is a real value, is called a cut. We say that cut (a, c) discerns a pair 
of objects x^ if either a{x) < c < a{y) or a{y) < c < a{x). 

The decision tree for a given decision table is (in simplest case) a binary 
directed tree with test functions (i.e. boolean functions defined on the informa- 
tion vectors of objects) labelled in internal nodes and decision values labelled 
in leaves. In this paper, we consider decision trees using cuts as test functions. 
Every cut (a, c) is associated with test function /(a,c) such that for any object 
u £ U the value of /(a,c)(^) is equal to 1 (true) if and only if a{u) > c. The 
typical algorithm for decision tree induction can be described as follows: 

1. Eor a given set of objects t/, select a cut (a, CBest) of high quality among all 
possible cuts and all attributes; 

2. Induce a partition f/i, f /2 of by (<^7 ^Best) 5 

3. Recursively apply Step 1 to both sets Ui^U 2 of objects until some stopping 
condition is satisfied. 

Developing some decision tree induction methods [3, 11] and some supervised 
discretization methods [ 2 , 6 ], we should often solve the following problem: ”/or 
a given real value attribute a and set of candidate cuts {ci, ...,Cjv}; find a cut 
(a, c^) belonging to the set of optimal cuts vuith high probability. 
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Definition 1 The d-tuple of integers 'Is called class distribution of 

the set of objects X C U iff Xk = card{X Pi DECk) for k G {1, d}. If the set 

of objects X is defined by X = {u £ U : p < a{u) < q\ for some p, g G M then 
the class distribution of X can be called the class distribution in [p;g). 

Any cut c G splits the domain Va = (laXa) of the attribute a into two 
intervals: 1l = {la^ c) ;Ir = (c, r^). We will use the following notation: 

— , IdRj - tho sets of objects from j^^^ class in II and Ir. Let Ur = 

and Ur = |J^. Ur^ where j G {1, d}; 

— and {Ri^ Rf) - class distributions in Ur and Ur. Let L = 

Tj=i Lj and R = Rj 

— Cj = Lj + Rj - number of objects in the j^^^ class; 

— n = Cj = L R - the total number of objects; 

Usually for a fixed attribute a and the set of all relevant cuts = {ci, c^v} 
on a, we use some measure (or quality functions) F : {ci, ^ M to es- 

timate the quality of cuts. For a given measure the straightforward algo- 
rithm. should compute the values of F for all cuts: F'(ci), F'(ctv)- The cut 
CBest which maximizes or minimizes the value of function F is selected as the 
result of searching process. The most popular measures for decision tree in- 
duction are Entropy Function” and ”Gvo/Us index” [4, 1, 11]. In this paper we 
consider the discernibility measure as a quality function. Intuitively energy of 
the set of objects X C U can be defined by the number of pairs of objects 
from X to be discerned called conflict(X). Let (Ai, ..., Nd) be a class distribu- 
tion of A, then conflict(X) can be computed by conflict(X) = XiNj . 

The cut c which divides the set of objects U into Ui, and is evaluated by 
IU(c) = conflict{U) — conflict{Ui) — conflict{U 2 ) i.e. the more is number of 
pairs of objects discerned by the cut (a, c), the larger is chance that c can be 
chosen to the optimal set of cut. Hence, in the decision tree induction algorithms 
based on Rough Set and Boolean reasoning approach, the quality of a given cut 
c is defined by 



d d d d 

w{c) = j2cRj ( 1 ) 

i^j i=l i=l i=l 

This algorithm is called Maximal-Discernibility heuristics or the MD -heuristics. 
The high accuracy of decision trees constructed by using MD-heuristic and their 
comparison with Entropy-based decision methods has been reported in [7] . 

For given set of candidate cuts = {ci, .., c^v} on a, by median of the 
decision class (denoted by M edian{k)) we mean the cut c G minimizing the 
value \Lk — Rk\ - Let = mini{M edian{i)} and c^ax = nidiKi{M edian{i)} 
we have shown (in [8]) the technique for irrelevant cut eliminating called ”Tail 
cuts can be eliminated” as follows. 

Theorem 1 The quality function W : {ci,..,Cjv} ^ N defined over the set of 
cuts is increasing in {ci, ...jCmin} decreasing in {cmax^ ---Xn}' 




On Efficient Construction of Decision Trees from Large Databases 357 



This property is interesting because it implies that the best cut CBest can be 
found in the interval {cmin: •••7 c^ax} using only 0{dlog N) queries to determine 
the medians of decision classes (by applying Binary Search Algorithm) and to 
eliminate all tail cuts. Let us also observe that if all decision classes have similar 
medians then almost all cuts can be eliminated. 



Example We consider a data table consisting of 12000 records. Objects are 
classified into 3 decision classes with the distribution (5000, 5600, 1400), respec- 
tively. One real value attribute has been selected and N = 500 cuts on its domain 
has generated class distributions as shown in Figure 1. 



The medians of three 
decision classes are ciee, 
C 414 and Cis 9 , respec- 
tively. The median of 
every decision class has 
been determined by bi- 
nary search algorithm 
using log A = 9 simple 
queries. Applying Theo- 
rem 1 we conclude that 
it is enough to con- 
sider only cuts from 
{ci 66 , C414}. In this 
way 251 cuts have been 
eliminated by using 27 
simple queries only. 



Distribution for first class 

Median(1) 




50 100 1 50 200 250 300 350 400 450 

Distribution for second class 

Median(2) 




100 150 200 250 300 350 400 450 



Distribution for third class 

^Median{3) 



100 150 200 250 300 350 400 450 



Fig. 1. Distributions for decision classes 1, 2, 3. 



3 Divide and Conquer Strategy 

The main idea is to apply the divide and conquer^^ strategy to determine the 
best cut CBest ^ {ci, Cn} with respect to a given quality function. 

First we divide the set of possible cuts into k intervals (e.g. k = 2,3,..). 
Then we choose the interval to which the best cut may belong with the highest 
probability. We will use some approximating measures to predict the interval 
which probably contains the best cut with respect to discernibility measure. 
This process is repeated until the considered interval consists of one cut. Then 
the best cut can be chosen between all visited cuts. 

The problem arises how to define the measure evaluating the quality of the 
interval [cb] cr] having class distributions: (Li, ..., Ld) in {—oo; cb); (Afi, ..., Md) 
in [cb] cb); and (i?i, ..., Rd) in [c/?; oo). This measure should estimate the quality 
of the best cut among those belonging to the interval [cb;cb\. In next Section 
we present some theoretical considering about the quality of the best cut in 
[cb] cb]. These results will be used to construct the relevant measure to estimate 
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the quality of the whole interval. We consider two specific probabilistic models 
for distribution of objects in the interval [cl]Cb\- 



Independency model: Let us consider an arbitrary cut c lying between cl and 
cr and let us assume that (xi,X 2 , is a class distribution of the interval 

[c/,; c]. In this model we assume that xi, X 2 , are independent random vari- 

ables with uniform distribution over sets {0, Mi}, {0, Ma}^ respectively 
One can observe that for all i G {1, ..,d| E{xi) = ^ and D‘^{xi) = 

We have shown in [8] the following theorem: 



Theorem 2 The mean E{W {c)) of quality W (c) for any cut c G [cr; cr] satisfies 



E{W{c)) = 



W{cl) + W{cr) + conflict{[cL] cr]) 



and the standard deviation of W (c) is equal to 

]Ui{Mi + 2) 



D^W{c))=J2 



i=l 



12 






vjhere conflict{[cR] cr]) = J2i<j 



( 2 ) 

( 3 ) 



Full dependent model: In this model, we assume that the values 
are proportional to Mi, ..., Mdj i.e. 

xi ^ ^ ^ ^ Xd 

Wi~Jh~'"~Wd 

In this model we have the following theorem: 



Theorem 3 In full independent models quality of the best cut in interval cr\ 
is equal to 



W{cBest) = 



W{cr) + W{cr) + conflid,{[cR]CR]) [W (cr) - W {cr)]^ 



2 8 ■ conflict{[cR;cR\) 

if \W (cr) — W {cr) \ < 2 • conflict{[cR] cr\). Otherwise it is evaluated by 

m^x{W{cR),W{cR)}. 



( 4 ) 



3.1 Evaluation measures 



These are two extreme cases of independent and ” fully” dependent random vari- 
ables of object distribution between decision classes. For real-life data one can 
expect that the variables are ” partially” dependent. Hence we base our heuristic 
on hypothesis that the derived formula for the quality of the best cut in [cr; cr] 



Eval {[cr; cr]) 



W{cr) + W{cr) + conflict {[ cr]Cr\) ^ ^ 



( 5 ) 
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where the value of A is defined by: 

^ ^ [W{cR) -W{cL)f dependent model) 

%-conflict{[cL-CR]) ^ ^ ^ 

A = a ' \J D‘^{W (c)) for some a G [0; 1]; (in the independent model) 

The choice of A and the value of parameter a from [0; 1] can be tuned in 
learning process or are given by expert. 

3.2 Local and Global Search 

We present two strategies of searching for the best cut using formula 5 called 
local and global search. In local search algorithm, first we discover the best cuts 
on every attribute separately. Next, we compare all locally best cuts to find out 
the globally best one. The details of local algorithm can be described as follows: 



Algorithm: Searching for semi-optimal cut 
Parameters: /c g N and a g [0; 1] . 

Input: attribute a; the set of candidate cuts Ca = {ci, .., cat} on a; 
Output: The optimal cut c G Ca 

begin 

Left ^ min; Right ^ max; {see Theorem 1} 
while {Left < Right) 

1. Divide [Left; Right] into k intervals with equal length by (k + 1) 

boundary points i.e. 



Pi 



Left -p 



Right — Left 
k ’ 



for i = 0, .., A; . 

2. For i = l,..,k compute Eval{[cp^_^; Cp^], a) using Formula (5). Let 

[pj-i;Pj] be the interval with maximal value of Eval(.); 

3. Left^Pj-i; Right ^ pj; 

endwhile; 

Return the cut CLeftl 
end 



One can see that to determine the value Eval ([c/,; c/?]) we need only 0{d) simple 
SQL queries of the form:SELECT COUNT FROM . . . WHERE attribute BETWEEN 
cp AND cp. Hence the number of queries necessary for running our algorithm 
is of order 0{dklogj^ N). In practice we set k = 3 because the function f{k) = 
dklogj^ N over positive integers is taking minimum for k = 3. For A: > 2, instead 
choosing the best interval one can select the best union [pi-m]Pi] of 

m consecutive intervals in every step for a predefined parameter m < k. The 
modified algorithm needs more - but still 0{logN) - simple questions only. 

The global strategy is searching for the best cut over all attributes. At the 
beginning, the best cut can belong to every attribute, hence for each attribute 
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we keep the interval in which the best cut can be found (see Theorem 1), i.e. we 
have a collection of all potential intervals 



IntervaLLists = {(ai, h,ri), {a^, (afc,^fc,rfc)} 



Next we iteratively run the following procedure 

— remove the interval I = (a, having highest probability of containing 

the best cut (using Formula 5); 

— divide interval 1 into smaller ones 1 = Ii U ^ Ik] 

— insert fo IntervaLLists. 

This iterative step can be continued until we have one-element interval or the 
time limit of searching algorithm is exhausted. This strategy can be simply 
implemented using priority queue to store the set of all intervals, where priority 
of intervals is defined by Formula 5. 



3.3 Example 



In Figure 2 we show the 
graph of W {ci) for i G 
{166, ..., 414} and we illus- 
trated the outcome of appli- 
cation of our algorithm to 
the reduce set of cuts for 
k = 2 and Z\ = 0. 

First the cut C 290 is cho- 
sen and it is necessary 
to determine to which of 
the intervals [ci 66 ,C 29 o] and 
[^ 290 , C 414 ] the best cut be- 
longs. The values of func- 
tion Eval on these intervals 
is computed: Fig. 2. Graph of W(ci) for i e {166, ..,414}. 




Evdl(^\ciQQ ^ C2go\) — 23927102, Evdl(^\c2goj^4i4\) — 24374685. 



Hence, the best cut is predicted to belong to [C 290 , C 414 ] and the search process is 
reduced to the interval [C 290 , C 414 ]. The above procedure is repeated recursively 
until the selected interval consists of single cut only. For our example, the best cut 
C 296 has been successfully selected by our algorithm. In general, the cut selected 
by the algorithm is not necessarily the best. However, numerous experiments on 
different large data sets shown that the cut c* returned by the algorithm is close 
to the best cut CBest (he. * 100% is about 99.9%). 
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4 Conclusions 

The problem of optimal binary partition of continuous attribute domain for large 
data sets stored in relational data bases has been investigated. We reduced the 
number of simple queries from 0{N) to 0(log A^) to construct the partition very 
close to the optimal one. We plan to extend these results for other measures. 
Acknowledgement: This paper was supported by KBN grant 8T11C02519 and 
Wallenberg Foundation - WITAS project. 
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Abstract. This paper introduces a new approach to induced rules for 
quantitative evaluation, which can be viewed as a statistical extention 
of rough set methods. Eor this extension, chi-square distribution and 
E-distribution play an important role in statistical evaluation. 



1 Introduction 

Rough set based rule induction methods have been applied to knowledge discov- 
ery in databasesfl, 2, 6, 7]. The empirical results obtained show that they are very 
powerful and that some important knowledge has been extracted from datasets. 
However, quantitative evaluation of induced rules are based not on statistical 
evidence but on rather naive indices, such as conditional probabilities and func- 
tions of conditional probabilities. 

In this paper, we introduce a new approach to induced rules for quantitative 
evaluation, which can be viewed as a statistical extention of rough set methods. 
For this extension, chi-square distribution and F-distribution play an important 
role in statistical evaluation. 

The paper is organized as follows: Section 2 discusses the characteristics of 
contingency tables. Section 3 shows the definitions of statistical measures used 
for contingency tables and their assumptions. Section 4 presents an approach 
to statistical evaluation a rough set model and an illustrative example. Finally, 
Section 5 concludes this paper. 

2 From Information Systems to Contingency Tables 

2.1 Accuracy and Coverage 

In the subsequent sections, we adopt the following notations, which is introduced 
in [5]. Let U denote a nonempty, finite set called the universe and A denote 
a nonempty, finite set of attributes, i.e., a : U ^ Va for a G A, where lA 
is called the domain of a, respectively. Then, a decision table is defined as an 
information system, A = (U, A U The atomic formulas over C A U {d} 
and V are expressions of the form [a = v]^ called descriptors over B, where a E B 
and V G lA. The set F{B^V) of formulas over B is the least set containing all 

W. Ziarko and Y. Yao (Eds.): RSCTC 2000, LNAI 2005, pp. 362-369, 2001. 
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atomic formulas over B and closed with respect to disjunction, conjunction and 
negation. For each / G /a denote the meaning of / in A, i.e., the set 

of all objects in U with property /, defined inductively as follows. 

1. If / is of the form [a = v] then, /a = {s G U\a[s) = v} 

2- {f ^g) A = if V g)A = fA'^gA] {-'f)A = U - fa 

By the use of this framework, classification accuracy and coverage, or true pos- 
itive rate is defined as follows. 

Definition 1. 

Let R and D denote a formula in F{B, V) and a set of ohjeets whieh belong to 
a deeision d. Classifieation aeeuraey and eoverage(true positive rate) for d 
is defined as: 

aR{0) = F{D\R)), and ru{D) = ^(^1^)). 

where |A| denotes the eardinality of a set A, au{B) denotes a elassifieation 
aeeuraey of R as to elassifieation of D, and kr{B) denotes a eoverage, or a true 
positive rate of R to D, respeetively. 

It is notable that these two measures are equal to conditional probabilities: 
accuracy is a probability of D under the condition of R^ coverage is one of R 
under the condition of D, 



2.2 Contingency Tables 

From the viewpoint of information systems, contingency tables summarizes the 
relation between attributes with respect to frequencies. These viewpoints have 
already been discussed in [8, 9]. However, in this study, we focus on more statis- 
tical interpretation of this table. Let Ri and R 2 denote a formula in F[B^ F). 
A contingency tables is a table of a set of the meaning of the following for- 
mulas: |[A = 0]a\,\[Ri = 1]a\, \[^2 = 0]a|,|[A = 1]a|, |[A = 0 A i ^2 = 
0]a\,\[Ri = 0 a i^2 = 1 ]a|, \[Ri = 1 a i^2 = 0]A|,|[iil = 1 A i^2 = 1 ]a|, 
\[Ri = 0 V Ri = 1]a|(= I^D- This table is arranged into the form shown in 
Table 1. From this table, accuracy and coverage for [Ri = 0] ^ [R 2 = 0] are 



Table 1. Two way Contingency Table 

i^i = 0 i^i = 1 

R 2 = 0 a b a -\- b 

R 2 = 1 c d c-\- d 



a -\- c b -\- d a -\- b -\- c -\- d 

(= 10 = N ) 
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Table 2. A Small Dataset 

a b c d e 
0 0 0 0 1 
10 111 
0 1110 
11111 
0 0 10 0 



Table 3. Corresponding Contingency Table 



b=0 b=l 
e=0 1 12 

e=l 2 13 

3 2 5 



defined as: 



a[iii=0]([^2 = 0]) 
K [_ Ki =0]([^2 = 0]) 



[_Ki=0Aii2=0]A 

[-Ki=0]a 

[i^l=0Ai^2=0]A 
I [R2=0]a\ 



a 

Cl -\- c 
a 



and 



a -\~ b 



For example, let us consider an information table shown in Table 2. When we 
examine the relationship between b and e via a contingency table, first we count 
the frequencies of four elementary relations, called marginal distributions: [^ = 0], 
[b = 1], [e = 0], and [e = 1]. Then, we count the frequencies of four kinds of 
conjunction: [6 = 0] A [e = 0] , [6 = 0] A [e = 1] , [6 = 1] A [e = 0] , and 
[b = 1] A [e = 1]. Then, we obtain the following contingency table (Table 3). 
From this table, accuracy and coverage for [6 = 0] ^ [e = 0] are obtained as 
1/(1 + 2) = 1/3 and 1/(1 + 1) = 1/2. 

3 Chi-square Test 

The chi-square test is based on the following theorem[4]. 

Theorem 1. When a contingency table shown in Table 4 is given, the test statis- 
tic: 

2 ^ ~ cijbj/N)‘^ 

.ir, “■'>)/« 

follows chi-square distribute with the freedom of {n — l)(m — 1). 

In the case of binary attributes shown in Table 1, this test statistic can be trans- 
formed into the following simple formula and it follows the chi-square destruction 
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Table 4. Contingency Table 





Ai 


A 2 


• • • An Sum 


Bi 


Xll 


X2I 


^nl b\ 


B2 


X21 


X22 


• • • Xn2 b2 


Bm 


Xml 


Xm2 


^m2 bni 


Sum 


tti 


0^2 


• • • an Sum 



with the freedom of one. 

2 _ yV(xiiX22 - ^12^21)^ _ N{o.d-bc)‘^ 

^ o.io. 2 bib 2 (a T c)(6 T d)(a T 6)(c T d) 

One of the core ideas of chi-square test is that the test statistic measures 
the square of difference between the real value and the expected value of one 
column. In the example shown in Table 4, (xn— ai6i/A^)^ measures the difference 
between Xn and the expected value of this column ai6i/A^ where bi/N is a 
marginal distribution of 

Another core idea is that ai6i/A^ is equivalent to the variance of marginal 
distributions if they follow multinomial distributions. ^ Thus, chi-square test 
statistic is equivalent to total sum of the ratio of the square distance to 
the corresponding variance Actually, the theorem above comes from more 
general theorem as a corollary if a given multinomial distribution converges into 
a normal distribution. 

Theorem 2. If Xi,X 2 ,- • are randomly selected from the population follow- 
ing a normal distribution N{m^a‘^), the formula 

y = ^ 

^=1 

follows the distribution with the freedom of {n — 1). 

In the subsequent sections, we assume all the assumptions discussed above. 

4 Towards Statistical Extension of Rough Sets 

4.1 Rough Set Approximations and Contingency Tables 

The important ideas in rough sets is that real-world concepts can be captured by 
two approximations: lower and upper approximations [3]. Although these ideas 
are deterministic, they can be extended into naive probabilistic models if we set 
up precision as shown in Ziarko’s variable precision rough set model(VPRS)[10]. 

^ If the probabilities p and q come from the multinomial distribution, Npq is equal to 



variance. 
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Table 5. Contingency Table for Lower Approximation 

Ri=0 Ri = l 

R 2 = 0 a b a -\-b 

i?2 = 1 0 d d 

a b -\- d a -\- b -\- d 

Table 6. Contingency Table for Upper Approximation 

Ri=0 Ri = l 

R 2 = 0 a 0 a 

R 2 = 1 c d c-\- d 

a -\- c d a -\- c -\- d 



From the ideas of Ziarko’s VPRS, Tsumoto shows that lower and upper 
approximations of a target concept correspond to sets of examples which satisfy 
the following conditions [7]: 

Lower Approximation of D: U{RA\oiR{D) = LO}, 

Upper Approximation of D: U{Ra\k,u{D) = 1.0}, 

where R is a disjunctive or conjunctive formula. Thus, if we assume that all 
the attributes are binary, we can construct contingency tables corresponding to 
these two approximations as shown in the following subsubsections. 



Lower approximation. From the definition of accuracy shown in Section 3, 
the contingency table for lower approximation is obtained if c is set to 0. That 
is, the following contingency table corresponds to the lower approximation of 
R 2 =0. In this case, the test statistic is simplified into: 



A 



2 



N{adf 

d(^b T d^i^ct T b^d 



( 3 ) 



Upper approximation. From the definition of coverage shown in Section 3, 
the contingency table for lower approximation is obtained if c is set to 0. That 
is, the following contingency table corresponds to the lower approximation of 
R 2 =0. In this case, the test statistic is simplified into: 

2 _ A(ad)2 
(^d T c^ddi^c T 

4.2 Measuring Distance from Two Approximations 

As discussed in Section 3, the core idea of lest is to measure the distance from 
the ideal marginal distributions. In the above subsections, the equations 3 and 



( 4 ) 
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Table 7. Two Tables for x^-fest 







Ri = 0 


Ri = l 




R 2 -- 


= 0 




X12 


bo 


R 2 -- 


= 1 


X21 


X22 


bi 






ao 


ai 


N 






Ri=0 


Ri = 1 




R 2 = 


= 0 


aobo/N 


aibo/N 


"V 


R 2 -- 


= 1 


aobi /N 


aibi /N 


bi 



tto N 



4 measure the distance between lower and upper approximations and marginal 
distributions, respectively. In statistical analysis, if all columns (or row) match 
with marginal distributions, then we can say that these a variable Ri have no 
correlation to the other one R 2 . For the evaluation of this assertion, the y^-test 
statistic is used. From this statistic, we obtain the corresponding p- value that 
measures the probability if no correlation between R\ and R 2 is assumed. 

On the other hand, from the viewpoint of information tables, the test statistic 
can be viewed as the distance between the existing table and the table with no 
correlation (Table 7). Thus, intuitively we can conclude that the square test 
measures a similarity(quasi-distance) between two tables. 

From the statistical point of view, this y^ test statistic represents statistical 
information about information tables. Thus, if comparison between these test 
statistics is allowed, this test statistic can be used to measure a similarity between 
two tables. For this purpose, we can use the following theorem[4]. 

Theorem 3. Let Xi,X 2 , - ' Xn and yi,y 2 x * *; Z/m be randomly seleeted from two 
populations following a normal distribution N{x,a‘^) and a normal distribution 
N{y,a‘^) (with the same value of varianees), respeetively. the test statistie 



follows the F -distribution with the freedom 0 / (n — l,m — 1). 

The important assumption of this theorem is that the variances of two sam- 
ples are equal. In the case of two-way contingency table, this assumption is 
translated into that which four marginal distributions of one table is equivalent 
to that of the other table. Therefore, 



Corollary 1. If the four marginal distributions of an information table Ti is 
equal to those of the other table T 2 , then the test- statistie 



f{Ti,T2 



ypi) 

VCU) 



follows F -distribution with the freedom of {m—l/a—1), where m — 1 and n — 1 
are the freedom o/y^(4i) and y^ (4 2 ). 
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In this way, if we assume that the marginal distribution of an information 
table is the same as that of another table, we can compare these two tables. 
Thus, if we prepare the sample tables for lower and upper approximation, we can 
discuss whether the contingency table of an information system is statistically 
different from that of a sample table of lower or upper approximation. 



4.3 Example 



Let us consider an example shown in Table 3.^ The test statistic for this table 
is equal to: 



= 



5 >^ (1 - 2)2 



5 



— = 0.14 



(5) 

3 * 2 >^ 2*3 36 

The p-value of this test statistic is equal to 0.709. Thus, the probability that 
these two attributes have no correlation is equal to 0.709. Especially, this test 
statistic measures a similarity between Table 3 and Table 8. From the above 



Table 8. Compared Contingency Table 



b=0 b=l 
e=0 1.2 0.8 2 
e=l 1.8 1.2 3 
3 2T^ 



tables, let us calculate the similarity between Table 3 and a table for upper 
approximation shown in Table 9. The test statistic for Table 9 is equal to: 



Table 9. Table Corresponding to Upper Approximation 



b=0 b=l 
e=0 2 02 

e=l 1 23 

3 2^ 



^ 5 (4 - 0)2 80 

^ L- = = 2 22 

^ 3*2>^2>^3 36 



( 6 ) 



and p- value is 0.136. From the equationsS and 6, a similarity between these two 
tables is: 



2.22 

0T4 



16, 



2 Please note that this example is for illustration. For real- word statistical analysis, 
the values for each cell and the sample size should be much larger because all the 
statistical theorem describes the asymptotic characteristics. 
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which can be tested by ^'-distribution with the freedom of (1,1). Since the p- 
value of / is equal to 0.156, the probability that these two tables are different 
from each other is equal to 0.156. 

Thus, if we set up the precision (critical value) to be 0.05 as in statistical 
analysis, the null hypothesis that these two tables are similar to each other will be 
accepted. In other words, from the viewpoint of conservative (or strict) criteria, 
the conclusions in Table 3 is similar to those in Table 9 (upper approximation). 
Thus, it is weakly concluded that the meaning of [e = 0] can be regarded as 
upper approximation of [6 = 0]. 

5 Conclusion 

In this paper, we introduce a new approach to induced rules for quantitative 
evaluation, which can be viewed as a statistical extention of rough set methods. 
For this extension, chi-square distribution and F-distribution play an important 
role in statistical evaluation. Chi-square test statistic measures statistical infor- 
mation about an information table and F-test statistic is used to measure the 
difference between two tables. This paper is a preliminary study on a statisti- 
cal evaluation of information tables, and the discussions are very intuitive, not 
mathematically rigor. Also, for simplicity of discussion, we assume that all con- 
ditional attributes and decision attributes in information tables are binary. More 
formal analysis will appear in the future work. 
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Abstract. As the amount of information in the world is steadily inereasing, 
there is a growing demand for tools for analyzing the information. The problem 
of data mining is investigated in this paper. It is very important and useful to 
generate deeision rules and reason under ineonsisteney. Propositional default 
rules are generated in this paper. Based on analysis of ineonsisteney, Skowron’s 
default rule generation algorithm is improved. A eorresponding reasoning 
method with a rule-ehoosing stratagem of lower frequeney first under 
ineonsisteney is also developed. A suitable deeision ean be generated for any 
yet unseen objeet ineluding one with unknown attribute values and one that is 
even ineonsistent (eonfheting) with objeets of the training deeision table. The 
rule-ehoosing stratagem is shown to be valid by our experiments. 



1 Introduction 

As the amount of information in the world is steadily increasing, there is a growing 
demand for tools for analyzing the information, finding patterns in terms of implicit 
dependencies in data. Realizing that much of the collected data will not be handled or 
even seen by human beings, data mining technology will be of increasing importance 
in the future. Although simple statistical techniques for data analysis were developed 
long ago, advanced techniques for intelligent data analysis are not yet mature. As a 
result, there is a growing gap between data generation and data understanding. 

Rough sets have been introduced as a tool to process inexact, uncertain or vague 
knowledge in AI, like for example knowledge based systems in medicine, natural 
language processing, decision systems, approximate reasoning [1], [2], [3], [4]. Some 
rough set based methods and algorithms were developed to generate rules from a 
decision table without any conflicting objects in the last years [5], [6], [7], [8]. In 
these cases, definite rules may be generated. Unfortunately, there are lots of 
inconsistencies, or uncertainties in real life. It is needed to be able to reason also 
under inconsistency. Different experts may disagree on the classification of one 
particular object, in which case it is desirable to assign different trust to the respective 
conclusions. Also, if objects are classified inconsistently, we want still to generate 
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rules that can reflect the normal situation. The inconsistency problem may be caused 
by factors such as insufflciency of condition attributes, errors in measuring process 
and mistakes in recording process, etc. Some researchers have done some work in this 
filed trying to generate default rules from an inconsistent decision table [9], [10]. 

Skowron proposed a rough set framework that is able to generate and reason about 
classes for which no unique decision can be made [10]. His basic idea is to create 
indeterminacy in information systems, and generate rules covering the majority of the 
cases. Skowron considered the generation of indeterminacy through selecting 
projections over the condition attributes, allowing certain attributes to be excluded 
from consideration. The rules that result generally have at least two advantages as 
compared to deterministic rules: they are always simpler in structure, and though not 
entirely correct, they will in many cases prove to be better when handling yet unseen 
cases, being less susceptible to noise. Unfortunately, Skowron did not study the 
inconsistency problem thoroughly, so there are still some conflicts in his method. If a 
yet unseen case that is inconsistent with a case in the training decision table occurs, 
conflict results will be generated by different default rules extracted from the training 
decision table. In this case, the rules can not work, and no decision can be made. 

We will examine both explicit inconsistency and implicit inconsistency in an 
information system thoroughly in this paper. A new default rule extracting algorithm 
and its corresponding reasoning algorithm will be developed based on Skowron’s 
default rule extracting algorithm. A new rule -choosing stratagem under inconsistency 
will be presented. In section 2, we will briefly introduce Skowron’ s default rule 
extracting algorithm and analyze its shortcomings. In section 3, we will examine the 
explicit and implicit inconsistency in a decision table thoroughly. An improved 
default rule extracting algorithm and its corresponding reasoning algorithm with a 
new rule-choosing stratagem of lower frequency first will be developed in section 4. 
In section 5, we will illustrate the validity of our methods through some experiments. 
At last, we will conclude our work in section 6. 



2 Skowron’s Default Rule Extracting Algorithm 

Skowron developed an algorithm to extract default rules from a decision table even if 
it contains some inconsistent cases [10]. 

Algorithm 1: Skowron’s default rule extracting algorithm. 

Input A training decision table A*^(U,A*), where A"^ ^(C*,{D}), U is a finite and 
nonempty set called the universe that contains all cases (samples, or objects). A* is a 
finite, nonempty set of attributes, is a finite, nonempty set of condition attributes, 
and D is the decision attribute. 

Output, default rules. 

Step 1: Calculate the indiscemibility relation U/IND(C'^). If a class 
(E^k,c*)^U/IND(C*), k-l,...,|U/IND(C*)|), of cases inU/IND(C') can be classified into 
a decision class Xj with a membership that is greater than some threshold, then a 
corresponding default rule can be extracted. That is. 

If ^c'^(^(k,c*)Aj)^|E(k,c*)'^^j|/|^(k,c*)l-Mjr? then the following default rule can be 
generated: 
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R: Des(E(^^c*)>C*)^Des(Xj,D) | 

where, is the certainty factor (CF) of rule 

Des(E(T,c*),C*)^Des(Xj,D). 

Step 2\ Put the decision table A* into a decision table set \|/, that is, . 

Step 3\ If \|/ is empty (t|/^(|)), then stop; else, take a decision table A^(U,A) out of \|/ 
(\|/^\|/-{A}) and calculate its core attributes set Core(C), where, A^(C,{D}). We can 
select projections Cpr^C-Ccut (r^l,..*,Card(Core(C))) over the attributes. The 
projections are selected such that new indeterminacy results. The following 5 sub- 
steps should be done to these projections. 

1 . If Cpr=(|), then do nothing to the projection; else, do the following 4 steps to it. 

2. Insert the projection decision table A’^(U,A ) into \|/, that is, \|/^\|/u{A’}, where, 
A-(Cp„{D}). 

3. Calculate the indiscernibility relation U/IND(Cpj.). 

4. If a class E^^^cPr) (E(k,cpr)^U/IND(C*), k-l,...,|U/IND(Cp,)|) in U/IND(Cpr) of 
cases can be classified into a decision class Xj with a membership that is greater than 
some threshold, then a corresponding default rule can be extracted. That is. 

If EcPr(^(k,CPr)?^j)^l^(k cpr)'^Xj|/|E(^]^cpr)l-Mjr? l^cn thc following default rule can be 
generated: 

R : Dcs(E(^ qpj.^, Cpj.)^Des(Xj,D) | |E('pQp^^nXj|/|E(']^QP^^|. 

5. Facts are constructed that may potentially block the application of a default rule. 

If there is an Ej, E,e U/IND(C)AEjeE(p then the following fact can 

be generated: 

F’: Des(EpCcJ^-(R’), 
where, — i is a logical NOT operator. 

Step 4\ go to step 3. 

To make it clearer, let’s look at a simple training decision table. The information 
system A^(U,A), displayed in table 1, resulted from having observed a total of one 
hundred objects (the universe U) that were classified according to condition attributes 
C^{a,b,c}. The decision attribute is d. The partition of the universe induced by the 
condition attributes contains n^5 classes, namely E^ through E 5 . The class E 5 is shown 
split into two disjoint sets of objects, E 5 ^E 5 1 UE 5 2 , reflecting the different decisions, 
d^3 (for E 5 J and d^4 (for E 5 2 ). Hence, the system is inconsistent with respect to the 
objects in class E 5 . 

The discemibility matrix Mpj(C)^{mpj(i,j)}j^^j^ (over the condition attributes 
C^{a,b,c}) of the decision system is given in table 2 . Its core attribute set is {a,c}. If 
the threshold value (|Lijr) is set to be 0.55, we can obtain the following rules: 



Table 1. Example Information System Table 2. Diseemibility Matrix of the Deeision System 
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Ri! a^C3^d^ I 1-0, R 2 ' ^iCi^d2 1 1*^? t>2C^^d2 1 1.0, 

R4: a2^d2 1 1.0, R5: b3^d2 1 1.0, R^: a3^d3 1 0.8, 

R 7 ! b 5 — I 0.8, 

where, a rule a^bjCj^^dp | |LL means that if the values of the condition attributes a, b and 
c are separately i, j and k, then the decision of this object should be p with a certainty 
factor |Li. 

According to algorithm 1, the following rules and facts can also be generated from 
the projections (C-{a}, C-{c}, C-{a,b}, and C-{a,c}) of table 1: 

Rg (C-{a}): b2C3^d, | 0.62, R, (C-{c}): a^^d, | 0.91, 

Rio (C-{a,b}): C3^di | 0.56, R^ (C-{a,c}): b2^di | 0.59, 



Fi (C-{a}): 
F3(C-{a,b}): 



b3 — y 



nRi, 



F2 (C-{c}): 

F4(C-{a,c}): 



Ci^- 
Sij — y 



nR^, 

nRii- 



Cl — > — 1R1 



One can find that the premise parts of default rules are short. They can reason in 
absence on knowledge. Assume now that a new object is observed, for which the 
value of the attribute a is 1, whereas the values for all the other attributes of the object 
are unknown. The definite rules, those rules which certainty factor is 1, do not 
sanction and conclude in this case, we may however apply the default rule R^ to 
conclude (by assumption) that the decision in this case should be 1 with a certainty 
factor 0.91. If, later, further knowledge is made available, the assumption (and 
therefore the conclusion) may have to be retracted. Unfortunately, Skowron’s 
algorithm is not complete. For instance, if the object to be classified is a^b3C3, then the 
decision should be 1 (CF^l.O) according to rule R^ while it should be 2 (CF^l.O) 
according to rule R5. We can still not get the decision. Again, if the case to be 
classified is a^b5C2, then the decision should be 1 (CF^0.91) according to rule R^ while 
3 (CF^O.8) according to rule R7. How should the decision be derived in this case? 
Obviously, this problem is caused by conflicts between rules. Skowron’s analysis for 
the inconsistency in a decision table is not complete. 



3 On the Inconsistency of an Information System 

Through examining a decision table, we find that there may be 3 kinds of inconsistent 
information in a decision table from which default rules need to be extracted. 

1. There are some inconsistent objects in the training decision table. This kind of 
inconsistency may be caused in the following 3 cases: 

• The condition attributes is insufficient to describe objects. Some additional 
attributes will be needed to distinguish objects. 

• There may be errors in measuring process of the values for attributes and 
mistakes in recording process. 

• Some inconsistency may be generated in the preprocessing process of the 
original data. For example, in the discretizing process, some continuous values were 
converted into discrete ones. Some objects that can be distinguished from each other 
according to their original continuous attribute values before may become 
indistinguishable. 

2. Inconsistency generated through selecting projections over the condition 
attributes. 
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3, There may exist some objects outside of the training decision table that are 
inconsistent with objects in the training decision table. There is usually only a part of 
objects of the universe in the training decision table. So, we don’t know whether there 
is inconsistency in the universe even if the training decision table is consistent. Rules 
can generate inconsistent decisions for a yet unseen object even if they are all 
consistent in the training decision table. 

The first kind of inconsistency is explicit and can be discovered directly from the 
training decision table. The 2nd kind of inconsistency is generated by the rule 
generation algorithm and thus can be discovered in the rule generation process. 
Unfortunately, the 3rd kind of inconsistency is implicit and therefore unpredictable in 
the process of rule generation. We don’t know whether there is any 3rd kind of 
inconsistency until it happens. Skowron considered only the former 2 inconsistent 
cases in his algorithm. The conflict problems stated in section 2 were caused by the 
3rd kind of inconsistency. Rules generated from a limited training decision table that 
is a subset of the universe can only be consistent in the training decision table, and 
may be inconsistent in the universe. For example, rule R^ and R 5 are consistent in 
table 1. However, if there is an object a^b 3 C 3 in the universe, these two rules will be 
conflict. That is, they are inconsistent in the universe. 



4 Rule Generation under Inconsistency and its Corresponding 
Reasoning Method 

First, we modify Skowron’ s algorithm. Rules are written in the following new style: 

R: Des(E(i^c)> C)^Des(Xj,D) | (| Xj |, | 

That is, the certainty factor is not recorded directly in a rule. Two parameters, the 
number of objects of the intersection of the equivalence class E(k,c) fhe decision 
class Xj and the number of objects of the equivalence class ^^e recorded as its 
parameters instead. Then, the following rules can be extracted from table 1. 

Rii a^Cj^di I (50,50), R 2 : \ (5,5), Rji b2Ci^d2 1 (5,5), 

R 4 : a 2 ^d 2 I (40,40), R 5 : b3^d2|(10,10), R,: aj^dj I (4,5), 

Ryi b5— ^d3 1 (4,5), Rgi b 2 C 3 — ^di | (50,80), R^: a^— ^d^ | (50,55), 

Rio: C3^di I (50,90), I (50,85). 

The fact set remains unchanged. 

One can find that each rule has not only its certainty factor (|E(^QnXj|/|E(^k,c)D 
information, but also the information of the frequency (|E(^ q|) of the objects described 
by the premise of the rule occurring in the training decision table. q| is called the 
frequency of the rule. The 3rd kind of inconsistency can be processed using this 
information. 

Algorithm 2\ Reasoning method under inconsistency 

Input A rule set Q, Q^{Rj|Ul,...,n}, that can be matched by an object to be 
classified. Where, the parameters of rule Rj is (ai. Pi), its conclusion is yi, and 
yi^Tj(i^j)- If there are more than one rules (e.g. m rules) which map the object to a 
same conclusion, then the rule Ri|a/Pi^^Max{oCi/Pi^|i^l,...,m} is selected to be the 
representative of these m rules. If there are several rules (e.g. k rules) in these m rules 
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having the maximum a/pj^ at the same time, then rule Rj|a/pj^Max{a/pj|i^l,..*,k} 
should be selected. 

Output: The decision for the object with its certainty factor. 

Step I: Select the decision (y) for the object: 

T^v a/Pi^=Max { a/Pi^|i= 1 , . . . ,n} . 

If there are several decisions (e.g. k rules) have the maximum a/pi^, then decision 
yi|a/Pi^Max{a/Pi|i^l,...,k} should be selected. 

Step 2 : Calculate the certainty factor (CF) for the decision. 

CF^Min { a/p j|i^ 1 , . . . ,n} . 

That is, the certainty factor of the final decision is the minimum one of the 
certainty factors of all matched rules. Logical AND operation is adopted here. 

Our basic idea to select a suitable decision for a yet unseen object is that the classes 
containing fewer objects of a decision table may represent some special cases in the 
universe, thus, rules generated from these classes should have more priority in the 
reasoning process. This rule-choosing stratagem is called lower frequency first, that 
is, the rule with the lowest frequency has the greatest priority. Let’s take a look at the 
characteristics of this reasoning method. 

Suppose there are two inconsistent rules that can match the object to be classified. 

1 . If a/pi ^a2/p2, then ^y, pi-Min{pi|i-l, 2 }; 

2 . If P^^P2, theny^i, aj^Max{aj|i^l, 2 }; 

3 . If P^>P2, then 

• If02^p2, then^2; 

• If ai=Pi, we might suppose P2=Pi-a, a2=P2-b, then 

7i > 

. (a>b)^{a^l{a-b)> 

^ Y^ , (a>b)/\(a!(a-b)<^^) 

r, , (a>6)A(aV(a-^')= A)A(i3, |i = l,2}). 

Using this reasoning method and rule-choosing stratagem, a suitable decision can 
be generated for any object according to default rules generated from a training 
decision table. The two objects, a^b3C3 and a^b5C2, can not be classified by Skowron’s 
method, can be classified into suitable classes using this method now. The object 
a^b3C3 can be matched with Rule Ri and R5, its final decision is d2 with a certainty 
factor 1.0. The object a^b5C2 can be matched with Rule R9 and R7, its final decision is 
d3 with a certainty factor 0.8. 



5 Simulation Result 

To test the validity of the rule -choosing stratagem of this paper, we compare it with 
the rule-choosing stratagem of higher frequency first. In the stratagem of higher 
frequency first, we can reason in the following way. 

Suppose there are two inconsistent rules (R^ and R2) that can match the object to be 
classified. 

If a/Pi^O^/Pi, then'pYi, Pi=Max{Pj|i=l, 2 }; 

Else If Pi=P2, then ai=Max{cXi|i=l, 2 } ; 
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Else Ifp^>P 2 , then 

If a/Pi >a 2 /P 2 , then T^y^; 

Else y^Yi, aiVPi^Max{aiVPi|i^l,2}; 

5 data sets are used in our simulation experiments. Each data set is randomly 
divided into 2 equal parts, one (training set) is used for rule generation. The modified 
Skowron’s default rule generation algorithm is used. The experiments result is shown 
in table 3 . 



Table 3. Experiments Result 
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In Table 3, S is the data set, |S| is the cardinality of set S, C(S) is the number of 
conflicting samples in S, TS is the training set, |TS| is the cardinality of TS, C(TS) is 
the number of conflicting samples in TS, VS is the set used to test the recognition 
rate, C(R) is the number of test samples which can be matched with more than one 
rules, RR is the correct recognition rate, CCR is the rate of choosing a correct rule. 
LEE is the stratagem of lower frequency first, while HFF is the stratagem of higher 
frequency first. There are some conflicting rules for every sample to be recognized 
when the default rules generated by Skowron’s algorithm are used. Thus, RR^CCR 
for every testing. In order to get decision tables with many conflicting samples. Navi 
algorithm and Semi-Navi algorithm in Rosetta are used for discretization. 

From the experiments result of Table 3, one can find that the stratagem of lower 
frequency first is much better than the stratagem of higher frequency first when a 
training set itself is used to test the generated rules. And, the mis -recognized samples 
in the stratagem of lower frequency first are all conflicting samples in the training set. 
Thus, the rules are a good representation of the information of the training set if the 
stratagem of lower frequency first is used. Otherwise, the rules can not represent the 
information of the training set if the stratagem of higher frequency first is used. We 
can also find from Table 3 the recognition rate of the stratagem of lower frequency 
first is also much higher than the stratagem of higher frequency first when the whole 
data set is used in the recognition test. Moreover, the recognition rates of 3 data sets 
are higher than the recognition rate of their training set. This is unreasonable. 
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6 Conclusion 

We studied the problem of extracting default rules under inconsistency in this paper. 
Based on Skowron’s default rule generation algorithm, we examined the 
inconsistency that may happen in an information system thoroughly. We developed a 
new default rule representing style, rule generation algorithm from an inconsistent 
decision table and its corresponding reasoning method with the rule-choosing 
stratagem of lower frequency first under inconsistency. The default rules generated 
have strong ability to match new objects to be processed. A suitable decision can be 
generated for any yet unseen object including one with unknown attribute values and 
one that is even inconsistent with objects of the training decision table. 
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Abstract: In the paper nine different approaches to missing attribute values 
are presented and compared. Ten input data fries were used to investigate the 
performance of the nine methods to deal with missing attribute values. For 
testing both naive classification and new classification techniques of LERS 
(Learning from Examples based on Rough Sets) were used. The quality 
criterion was the average error rate achieved by ten-fold cross-validation. 
Using the Wilcoxon matched-pairs signed rank test, we conclude that the 
C4.5 approach and the method of ignoring examples with missing attribute 
values are the best methods among all nine approaches; the most common 
attribute- value method is the worst method among all nine approaches; while 
some methods do not differ from other methods significantly. The method of 
assigning to the missing attribute value all possible values of the attribute 
and the method of assigning to the missing attribute value all possible values 
of the attribute restricted to the same concept are excellent approaches based 
on our limited experimental results. However we do not have enough evidence 
to support the claim that these approaches are superior. 



Key words: Data mining, knowledge discovery in databases, machine 
learning, learning from examples, attribute missing values. 



1 Introduction 

One of the main tools of data mining is rule induction from raw data represented by a 
database. Real-life data are frequently imperfect: erroneous, incomplete, uncertain and 
vague. In the reported research we investigated one of the forms of data 
incompleteness: missing attribute values. 

We assume that the format of input data files is in the form of a table, which is 
called a decision table. In this table, each column represents one attribute, which 
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represents some feature of the examples, and each row represents an example by all its 
attribute values. The domain of each attribute may be either symbolic or numerical. 
We assume that all the attributes of input data are symbolic. Numerical attributes, 
after discretization, become symbolic as well. For each example, there is a decision 
value associated with it. The set of all examples with the same decision value is 
called a concept. Members of the concept are called positive examples, while all other 
examples are called negative examples. 

The table is inconsistent if there exist two examples with all attribute values 
identical, but belonging to different concepts. For inconsistent data tables, we can 
induce rules which are called certain and possible [5]. 

2 Description of Investigated Approaches to Missing 
Attribute Values 

We used the following nine approaches to missing attribute values: 

1 . Most Common Attribute Value. It is one of the simplest methods to deal 
with missing attribute values. The CN2 algorithm [3] uses this idea. The value of 
the attribute that occurs most often is selected to be the value for all the unknown 
values of the attribute. 

2 . Concept Most Common Attribute Value. The most common attribute 
value method does not pay any attention to the relationship between attributes and a 
decision. The concept most common attribute value method is a restriction of the 
first method to the concept, i.e., to all examples with the same value of the decision 
as an example with missing attribute vale [9]. This time the value of the attribute, 
which occurs the most common within the concept is selected to be the value for all 
the unknown values of the attribute. This method is also called maximum relative 
frequency method, or maximum conditional probability method (given concept). 

3 . C4.5. This method is based on entropy and splitting the example with missing 
attribute values to all concepts [12]. 

4 . Method of Assigning All Possible Values of the Attribute. In this 
method, an example with a missing attribute value is replaced by a set of new 
examples, in which the missing attribute value is replaced by all possible values of 
the attribute [4] . If we have some examples with more than one unknown attribute 
value, we will do our substitution for one attribute first, and then do the substitution 
for the next attribute, etc., until all unknown attribute values are replaced by new 
known attribute values. 

5. Method of Assigning All Possible Values of the Attribute 
Restricted to the Given Concept. The method of assigning all possible values 
of the attribute is not related with a concept. This method is a restriction of the 
method of assigning all possible values of the attribute to the concept, indicated by an 
example with a missing attribute value. 

6 . Method of Ignoring Examples with Unknown Attribute Values. 

This method is the simplest: just ignore the examples which have at least one 
unknown attribute value, and then use the rest of the table as input to the successive 
learning process. 
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7. Event-Covering Method. This method, described in [2] and [14], is also a 
probabilistic approach to fill in the unknown attribute values. By event- covering we 
mean covering or selecting a subset of statistically interdependent events in the 
outcome space of variable-pairs, disregarding whether or not the variables are 
statistically independent [14]. 

8 . A Special LEM2 Algorithm. A special version of LEM2 that works for 
unknown attribute values omits the examples with unknown attribute values when 
building the block for that attribute [6]. Then, a set of rules is induced by using the 
original LEM2 method. 

9. Method of Treating Missing Attribute Values as Special Values. 

In this method, we deal with the unknown attribute values using a totally different 
approach: rather than trying to find some known attribute value as its value, we treat 
“unknown” itself as a new value for the attributes that contain missing values and 
treat it in the same way as other values. 

3 Classification 

Frequently rules induced from raw data are used for classification of unseen, testing 
data. In the simplest form of classification, if more than one concept was indicated by 
rules for a given example, the classification of the example was counted as an error. 
Likewise, if an example was not completely classified by any of rules, it was 
considered an error. This classification scheme is said to be naive LERS classification 
scheme. 

The new classification system of LERS is a modification of the bucket brigade 
algorithm [1,7]. The decision to which concept an example belongs is made on the 
basis of three factors: strength, specificity, and support. They are defined as follows: 
Strength is the total number of examples correctly classified by the rule during 
training. Specificity is the total number of attribute-value pairs on the left-hand side 
of the rule. The matching rules with a larger number of attribute-value pairs are 
considered more specific. The third factor, support, is defined as the sum of scores of 
all matching rules from the concept. The concept C for which the support, i.e., the 
following expression 



X Strength(R) * Specificity(R) 

matching rules R describing C 



is the largest is a winner and the example is classified as being a member of C. 

If an example is not completely matched by any rule, some classification systems 
use partial matching. System AQ15, during partial matching, uses the probabilistic 
sum of all measures of fit for rules [10]. Another approach to partial matching is 
presented in [13]. Holland et al. [8] do not consider partial matching as a viable 
alternative of complete matching and rely on a default hierarchy instead. In the new 
classification system of LERS, if complete matching is impossible, all partially 
matching rules are identified. These are rules with at least one attribute-value pair 
matching the corresponding attribute-value pair of an example. 
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For any partially matching rule R, the additional factor, called Matching factor 
{R), is computed. Matching factor is defined as the ratio of the number of matched 
attribute-value pairs of a rule with an example to the total number of attribute-value 
pairs of the rule. In partial matching, the concept C for which the following 
expression is the largest 

X Matching_factor(7?) * Strength {R) * Specificity(7?) 

partially matching rules R describing C 



is the winner and the example is classified as being a member of C. 

Rules induced by a new version of LERS are preceded by three numbers: 
specificity, strength, and the total number of training examples matching the left-hand 
side of the rule. 

4 Experiments 

Table 1 describes input data files, in terms of the number of examples, the number of 
concepts, and the number of attributes that describe the examples, that were used for 
our experiments. All ten data files were taken from real world where unknown 
attribute values frequently occur. 



Table 1. Description of data files 



Name of Data Files 


No. of Examples 


No. of Attributes 


No. of Concepts 


Breast cancer 


286 


9 


2 


Echocardiogram 


74 


13 


2 


Hdynet 


1218 


73 


2 


Hepatitis 


155 


19 


2 


House 


435 


16 


2 


Im85 


201 


25 


86 


New-o 


213 


30 


2 


Primary tumor 


339 


17 


21 


Soybean 


307 


35 


19 


Tokt 


6608 


67 


2 



The breast cancer data set was obtained from the University Medical Center, 
Institute of Oncology, Ljubljana, Yugoslavia, due to donations from M. Zwitter and 
M. Soklic. Breast cancer is one of three data sets provided by the Oncology Institute 
that has repeatedly appeared in the machine learning literature. There are nine out of 
286 examples containing unknown attribute values. 
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The echocardiogram data set is donated by Steven Salzberg, and this data has been 
used several times to predict the survival of a patient. There are a total of 132 
missing values among all the attribute values. 

The hdynet data set, which comes from real life, presents the premature birth 
described by 73 attributes. There were 814 out of 1218 examples containing 
unknown attribute values. 

The hepatitis data set was donated by G. Gong, Camegie-Mellon University, via 
Bojan Cestnik of Jozef Stefan Institute. There were 75 out of 155 examples that 
contain unknown attribute values in this data set. 

Table 2. Error rates of input data sets by using LERS new classification 



Methods 



Data file 


1 


2 


3 


4 


5 


6 


7 


8 


9 


Breast 


34.62 


34.62 


31.5 


28.52 


31.88 


29.24 


34.97 


33.92 


32.52 


Echo 


6.76 


6.76 


5.4 


— 


— 


6.56 


6.76 


6.76 


6.76 


Hdynet 


29.15 


31.53 


22.6 


— 


— 


28.41 


28.82 


27.91 


28.41 


Hepatitis 


24.52 


13.55 


19.4 


— 


— 


18.75 


16.77 


18.71 


19.35 


House 


5.06 


5.29 


4.6 


— 


— 


4.74 


4.83 


5.75 


6.44 


Im85 


96.02 


96.02 


100 


— 


96.02 


94.34 


96.02 


96.02 


96.02 


New-o 


5.16 


4.23 


6.5 


— 


— 


4.9 


4.69 


4.23 


3.76 


Primary 


66.67 


62.83 


62.0 


41.57 


47.03 


66.67 


64.9 


69.03 


67.55 


Soybean 


15.96 


18.24 


13.4 


— 


4.1 


15.41 


19.87 


17.26 


16.94 


Tokt 


31.57 


31.57 


26.7 


32.75 


32.75 


32.88 


32.16 


33.2 


32.16 



Table 3. Error rates of input data sets by using LERS naive classification 



Methods 



Data file 


1 


2 


4 


5 


6 


7 


8 


9 


Breast 


49.30 


52.1 


46.98 


47.32 


48.38 


52.8 


52.1 


47.55 


Echo 


27.03 


25.68 


— 


— 


31.15 


29.73 


33.78 


22.97 


Hdynet 


67.49 


69.62 


— 


— 


65.27 


69.21 


56.98 


61.33 


Hepatitis 


38.06 


28.39 


— 


— 


32.5 


37.42 


41.29 


34.84 


House 


10.11 


7.13 


— 


— 


9.05 


10.57 


12.87 


11.72 


Im85 


97.01 


97.01 


— 


97.01 


94.34 


97.01 


97.01 


97.01 


New-o 


11.74 


11.74 


— 


— 


11.19 


11.27 


10.33 


10.33 


Primary 


83.19 


77.29 


53.16 


60.09 


81.82 


80.53 


82.1 


79.94 


Soybean 


25.41 


22.48 


— 


4.86 


24.06 


24.10 


21.82 


22.15 


Tokt 


63.62 


63.62 


62.82 


62.82 


64.15 


63.36 


63.62 


63.89 
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The house data set, which has 203 examples that contain unknown attribute 
values, consists of votes of 435 congressmen in 1984 on 16 key-issues (yes or no). 

The im85 data set is from a 1985 Automobile Imports Database, and it consists 
of three types of entities: a) the specification of an auto in terms of various 

characteristics, b) its assigned insurance risk rating, and c) its normalized losses in use 
as compared to other cars. 

The new-o data set is another set of breast cancer data that uses different attributes 
from the breast cancer data set. In this approach, there are 30 attributes to describe the 
examples. There were a total of 213 examples, and 70 of them have at least one 
unknown attribute value. 

The primary-tumor data set was obtained from the University Medical Center, 
Institute of Oncology, Ljubljana, Yugoslavia. The data set primary -tumor has 21 
concepts and 17 attributes, and 207 out of 339 examples contain at least one missing 
value. 

For the soybean data set, R. S. Michalski used this data set in the context of 
developing an expert system for soybean disease diagnosis. There are 19 classes, but, 
only the first 15 classes have been used in prior work. And, the last four classes have 
very few examples and there are 41 examples that contain unknown attribute values. 

The tokt data set, which is the largest data file in this experiment, came from the 
practical data about premature birth, which is similar to the hdynet data set. Among 
6619 examples in this data set, only 1 1 examples contain unknown attribute values. 

In our experiments, we required that no decision value is unknown. If some 
unknown decision values existed in the input data files, the input data files were pre- 
processed to remove them. 

Our experiments were conducted as follows. All of the nine methods from 
Section 2 were applied to all the ten data sets. Both original data sets and our new 
data sets, except for C4.5 method, were sampled into ten pairs of training and testing 
data. Then the sampled files were used as input to LEM2 single local covering [5] to 
generate classification rules, except the special LEM2 method, where rules were 
induced directly from the data file with missing attribute values. Other data mining 
systems based on rough set theory are described in [11]. We used ten-fold cross 
validation for the simple and extended classification methods. The performance of 
different methods was compared by calculating the average error rate. Here, we did a 
slight modification using leaving-one-out for the data set echocardiogram since it has 
less than 100 examples. 

In Tables 2 and 3, the error rates that were not available, because of the limited 
system memory, are indicated by 

5 Conclusions 

Our main objective was comparison of the methods to deal with missing attribute 
values. Results of our experiments are presented in Table 2 and Table 3. In order to 
rank those methods in a reasonable way we used the Wilcoxon matched-pairs signed 
rank test [7]. 
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The very first observation is that the extended (LERS) classification is always 
better than the simple classification method. 

Results of the Wilcoxon matched-pairs signed rank test are: using LERS new 
classification method, C4.5 (method 3) is better than method 1 with a significance 
level 0.005. Also, method 6 is better than method 1, LEM2 (method 8) and method 9 
with significance level 0.1. Differences in performance for other combinations of 
methods are statistically insignificant. Similarly, for LERS naive classification, 
results of the Wilcoxon matched-pairs signed rank test are: method 2 is better than 
method 7 with significance level 0.1, method 9 is better than methods 1 and 7, in 
both cases with the significance level 0.05, and, finally, method 6 performs better 
than method 1 with significance level 0.05. Differences in performance for other 
combinations of methods are statistically insignificant. 

For methods that do not differ from each other significantly with respect to the 
Wilcoxon matched-pairs signed rank test, we estimated their relative performance by 
the number of test cases that have smaller error rate. If one method performs better 
than the other in more than 50% of the test cases, we — heuristically — conclude that it 
performs better than the other one. For example, in Table 2, since the C4.5 approach 
gives a smaller error rate than method 6 in 6 out of 10 test cases, we can conclude that 
using LERS new classification, the C4.5 approach performs better than method 6. 
Based on this heuristic evaluation principle, among all the indistinguishable methods 
except for method 4 and method 5, we observe that using LERS new classification, 
the C4.5 approach performs better than any other method; method 6 performs better 
than any other method except for the C4.5 approach; and method 1 performs worse 
than any other method. When using the LERS naive classification, method 9 
performs better than any other method; method 2 performs better than any other 
methods except for method 9; and method 1 performs worse than any other method. 

We do not have enough experimental results for method 4 and method 5. But 
from our available results, they perform very well. These methods are promising 
candidates for the best-performance methods. However, it is risky for us to conclude 
that they are the best methods among all nine methods because we do not have 
enough test files to support this conjecture statistically, using the Wilcoxon matched- 
pairs signed rank tests. Using both new and naive classification of LERS, the error 
rate of method 4 is smaller than that of any other method in more than 50% of the 
applicable test cases; method 5 has a smaller error rate than any other methods, except 
method 4, in more than 50% of the applicable test cases. The approaches of method 4 
and method 5 are similar. By substituting missing value by all possible values of an 
attribute in our substitution, we can get as much information as possible, but the size 
of the resulting table may increase exponentially, thus we cannot get the results for 
some of our data sets because of insufficient system memory. 
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Abstract. 

The research of similarity between DNA sequences is an important problem in 
Bio-Informatics. In the traditional approach, the dynamic programming based pair- 
wise alignment is used for measuring the similarity between two sequences. This 
method does not work well in a large data set. In this paper, we consider motif like the 
phrase of document and use text mining techniques for finding the frequent motifs, 
maximal frequent motifs, motif based association rules in a group of genes. 



1. Introduction 

In text mining, the phrases of document play an important role in developing the 
text mining algorithms [1]. We consider motif - sub sequences that occur relatively 
often in a set of DNA sequences- as a phrase of document and develop the algorithms 
for discovering the motif based association rules. The study of motif has been 
considered in [2,5], these techniques do not work well with a large data set of genes 
which are popular in the Internet. In this paper, we propose algorithms for finding the 
motifs, motif based association rules based on the idea of association rule mining. 
Based on the association rule discovery algorithms [3,7], we develop algorithms for 
discovering the motif, the motif based association rules. We also tested our proposed 
algorithms and present the experiment results from the data of 106 DNA promoter 
sequences of the UCl repository of machine learning database. The paper is organized 
as follows 1) Introduction 2) The problem of frequent motif discovery 3) The problem 
of discovering the relationship among motifs and classification rules 4) Conclusions 
and future works. 



2. The Problem of Frequent Motif Discovery 



Let A = {“A”, “C”, ‘T”, “G”} be the set of bases forming DNA sequences, each 
base is a nucleotide [6]. Each DNA sequence is considered as a text string Si,S2, ...Sn 
where Sk e A, k=l,...,n. We denote Isl as the length of sequence s. Let s be a DNA 
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sequence, P(s) be the set of sub sequences (sub-string) of s. We define in S an order 
relation “<” as: Vsa, Sb e P(s) , Sa < Sb Sa is a sub sequence of Sb. Given Sm e P(s), 
Sni is called a maximal element of Sa if there is no Sb e P(s) other than Sm such that 



2.1. Frequent Motifs and Maximal Frequent Motifs in a Set of DNA Sequences 

Let S be a set of m DNA sequences, P(^) be the set of all sub sequences of si and 
PU be the union of all P(sO for i=l,...,m. Let N(St) be the number of DNA sequences 
containing Si. Given St e PU and a threshold xe [0,1], St is called a frequent motif if 
N(st)/ m >= X. We denote F(S,x) = [ St e PU I N(St)/m >= x}. 

It is easy to hold that if Sa e F(S, x) and St e P(Sa) => St e F(S, x). 

Let S be a set of m DNA sequences and F(S, x) be the set of all the frequent motifs 
with threshold x. Given a e F(S,x), a is called a maximal frequent motif of S, if and 
only if i) a e F(S, x) and ii) b e F(S,x) ,a ^b, a < b. 



2.2. A Proposed Algorithm for Finding the Frequent Motifs 

Given a set S of m DNA sequences and a threshold x g [0,1], find all the frequent 
motifs. 

Example 1: Given S = [si,S 2 ,S 3 ,S 4 } containing four DNA sequences as follows: 

S] = ’ACGTAAAAGTCACACGTAGCCCCACGTACAGT 
S2 = 'CGCGTCGA AGTCGACCGTA AAAGTCAC AC AGT' 

= 'GGTCGATGCACGTAA AATCAGTCGC AC AC AGT' 

S4 =' ACGTAAAAGTAGCTACCCGTACGTC ACACAGT' 

With threshold x=l .0, some frequent motifs are as follows: 

TCA, GTC, CACA, ACAGT, CGTAAAA. 

We develop a proposed algorithm for finding frequent motifs. Let L(S,k,x) be a set 
of all frequent motifs with the threshold x and k is the length ( number of bases) of 
these motifs. The proposed algorithm is as follows: 

Answer =0 

Generate L(S, 1 , x) from [ “A”, ‘G”, ‘T’, ‘G’’} 

For ( k=2; L(S,1 ,x) <> { } ; k++) do begin 
Generate L(S,k, x) from L(S,k-l,x) 

Answer = Uk L(S,k, x) 
end 

Return Answer 

a) Generate LCS,!,^) 

The one letter motifs are possible “A”, “C”, ‘T’, ‘G’, so we need to check each of 
them, if it satisfies the definition of motif then save it into L(S,1, x). 

b) Generate L(S,k, t) from L(S,k-l, x) and L (S^l^T) 

It is easy to hold that s^ e F(S, x) and Si e P(sJ => St g F(S, x), we employ this 
proposition to generate L(S, k, x) from L(S,k-l, x) and L (S,l,x). 




388 H. Kiem and D. Phuc 



The proposed algorithm is summarized as follows: 

Create a matrix which row and column are L (S ,1 ,x). 

L(S,k,x) = 0 

For (each Sy e L(S,k-l, x)) do 
For (each s x e L(S, 1 ,x) ) do 
begin 

s t = Sy + Sx // string concatenation 

lf(N(St)/m >= x) and IstI == k) then SaveFreqMotif (St, L(S,k,x)) 
end; 

Answer = L(S,k,x) 

Return Answer 

SaveFreqMotif(St,L(S,k,x)) is the function for saving the frequent motif Si into 
L(S,k,x). 

In the data set of 106 DNA promoter sequences which is divide into two classes: 
promoter class and non- promoter class. With threshold T=0.3, we discover 97 
frequent motifs as follows: 

A; C; T; G; AA; AC; AT; AG; CA; CC; CT; CG; TA; TC; TT; TG; GA; GC; GT; 
GG; AAA; AAC; AAT; AAG; ACA; ACC; ACT; ACG; ATA; ATC; ATT; ATG; 
AGA; AGC; AGT; AGG; CAA; CAC; CAT;CAG; CCA; CCT; CCG; CTA; CTC; 
CTT; CTG; CGA; CGC;CGT; CGG; TAA; TAC; TAT; TAG; TCA; TCC; TCT; 
TCG; TTA;TTC; TTT; TTG; TGA; TGC; TGT; TGG; GAA; GAC; GAT; GAG; 
GCA; GCC; GCT; GCG; GTA; GTC; GTT; GTG; GGA; GGC; GGT; AACT; 
AATG; ATGC; CAAT; CTTT; CTTG; TAAC; TACT; TACG; TTTT; TTGA; 
TTGT; GCAT; GCCT; GCTT and 58 maximal frequent motifs as follows: 

AAA; AAG; ACA; ACC; ATA; ATC; ATT; AGA; AGC; AGT; AGG; CAC; 
CAG; CCA; CCG; CTA; CTC; CTG; CGA; CGC; CGT; CGG; TAT; TAG; TCA; 
TCC; TCT; TCG; TTA; TTC; TGG; GAA; GAC; GAT; GAG; GCG; GTA; GTC; 
GTT; GTG; GGA; GGC; GGT; AACT; AATG; ATGC; CAAT; CTTT; CTTG; 
TAAC; TACT; TACG; TTTT; TTGA; TTGT; GCAT; GCCT; GCTT 



3. The Problem of Discovering the Relationship among Motifs 
and Classification Rules 

Given an finite set O of m objects and a finite set Dt of n descriptors, let R be the 
binary relation from O to D i. Binary relation R is represented by a matrix B i. B i is 

called the information matrix. Let bjj (i=l,..,,m and j=l,...,n) be the element of 

matrix B r, bij=l if (Oi,dj) e R or object Oi has descriptor dj , otherwise bij=0. Given O 
and Dt, let P(D i) be a power set of Di and P(0) be a power set of O. 

We define functions p and X p: P(Dt) P(0) and X : P(0) — > P(Dt) as 
follows: 

• Given S c Dt then p(S) = [o e 0/Vd g S, (o,d) g R } 

• Given X e O then A.(X) = {d g Dt/ Vog X, (o,d) g R } 
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3.1* Large Descriptor Ssets and Maximal Large Descriptor Sets 

Given an information matrix Bt of O, Dt and a threshold MINSUPe [0,1], Let 
S6 P(Dt), S is called a large descriptor set S of Bt if S satisfies the condition: 

Card( p(S))/Card(0)>=MINSUP 

where Card is the cardinality of set. Let L be the set of all large descriptor sets, given 
Sm e L, Sm is called a maximal large descriptor set if and only if - 3 Sl e L and 
SmCZSl. It is easy to hold that Sa e L and Sb e Sa-^ Sb e L. 



3.2. Association Rules and Confidence Factor 

Given O, Dt, Bt and a threshold MINSUP, let S be a large descriptor set of B, and 
S=SauSb and SaO Sb= 0. An association rule is a mapping from Sa to Sb and is 
denoted as Sa^Sb- The confidence factor (CF) of this rule is calculated by: 

Card( p(SA)np(SB))/Card(p(SA)). 

The CF shows the confidence of the possibility of occurrence of 4 if Sa is given. 
In normal, given a threshold MINCONFe [0,1] and MINSUP e[0,1], find the 
association rules which have support greater than MINSUP and CF greater than 
threshold MINCONF. This problem was solved by many algorithms in [3,7]. 



3.3. Discovering the Relationships among Motifs 

A lai^ge descriptor set is really a significant combination of frequent motifs. We 
employ the set of discovered frequent motifs D as a part of descriptor set Dt and O as 
a set of DNA sequences. From O and DT=D’u{promoter+, promoter-}, we create an 
information matrix and employ the algorithms in [3,7] for discovering large descriptor 
sets, maximal large descriptor sets and the association rules. 

With the data set is 106 DNA promoters and MINSUP=30%, we discover 97 large 
descriptor sets and 58 maximal large descriptor sets. Some maximal large descriptor 
sets are listed as follows: 

{ AAG,ACA) support = 0.39 ; [ACC, CCA} support =0.34 ; 

{ AGA,ATC} support = 0.32 ; {TCT,TTA,TTC} support = 0.37; 

Some typical association rules discovered from 106 DNA promoter sequences are 
listed as follows: 

a) For promoter class 

(ATA,ATT} => Promoter + confidence = 0.80; support = 0.30 

(ATA, TTA} => Promoter + confidence = 0.78; support = 0.30 

b) For non promoter class 

[AGA, CTC} => Promoter- confidence = 0.75; support = 0.31 

[AGA, GAC} ^Promoter- confidence = 0.72; support = 0.31 

We plan to employ the above association rules as a mean for classification rules 
[6]. With rule {ATA , ATT} => Promoter + confidence = 0.80 support = 0.30 

means that if DNA sequence contains motif ATA and motif ATT then there are 80% 
this sequence belonging to the promoter class. 
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3.4. Negation of the Information Matrix and Negative Association Rules 

Given a binary relation R from O to Di=(di,...,dn}- Let DN={'-di,...,-dn}, we define 
a binary relation R' from O to Dn. Let be the information matrix of R' and Bt be 
the information matrix of R then B n= ~Bt. This matrix is called the negation of the 
information matrix Bt . We employed the same methods with Bn for discovering the 
large descriptor sets, maximal large descriptor sets and association rules. This kind of 
association rules is called negative association rules [3]. 

With the MINSUP=0.3 and MINCONF=0.7, we discover the following negative 
association rules from 106 DNA promoters, some of them are as follows: 

{-AAA, -CTTG,-TTTT } ^ - Promoter +, confidence = 0.86, support = 0.30 
{-CAAT, -TACG} ^ -Promoter-, confidence = 0.71, support = 0.32 
We employ the motif based association rules and the negative association rules for 
developing the gene classification problem as we did in text classification [4]. 



4. Conclusions and Future Works 

From the view of association rule mining techniques, we have developed the 
algorithms for finding the frequent motifs in a set of DNA sequences. The algorithms 
for discovering motifs, significant combinations of motifs, the association rules 
discovery algorithms are proposed and tested on the data set of 106 DNA promoters 
of UCI. The experimental results encourage us to use the frequent motif as features 
for DNA biological sequences for gene identification problems. 
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Abstract. An important data mining problem is to restrict the number of 
association rules to those that are novel, interesting, useful. However, there are 
situations when a user is not allowed to access the database and can deal only 
with the rules provided by somebody else. The number of rules can be limited 
e.g. for security reasons or the rules are of low quality. Still, the user hopes to 
find new interesting relationships. In this paper we propose how to induce as 
much knowledge as possible from the provided set of rules. The algorithms for 
inducing theory as well as for computing maximal covering rules for the theory 
are provided. In addition, we show how to test the consistency of rules and how 
to extract a consistent subset of rules. 



1 Introduction 

The problem of diseovery of strong assoeiation rules was introdueed in [1] for sales 
transaetion database. The assoeiation rules identify sets of items that are purehased 
together with other sets of items. An important data mining problem is to restriet the 
number of assoeiation rules to those that are novel, interesting, useful. However, there 
are situations when a user is not allowed to access the database and can deal only with 
the rules provided by somebody else. The number of rules ean be limited e.g. for 
seeurity reasons or the rules are of low quality. Still, the user hopes to find new 
interesting relationships. The user may be even willing to induee as mueh knowledge 
as possible from the provided set of rules. We addressed this problem in [5]. We 
offered there how to use the eover and extension operators in order to augment the 
original knowledge. The eover operator does not require any information on statistieal 
importanee (support) of rules and produees at least as good rules as original ones; the 
extension operator requires information on support of original rules. The newly 
indueed rules ean be of higher quality than the original one [5]. Additionally, it was 
shown in [5] how to eompute the least set of rules ealled maximal eovering rules that 
represents all rules that ean be indueed by the two operators from the original rule set. 
It was shown that in general, maximal eovering rules do not eonstitute a subset of the 
original rules set. Some indueed rules ean be maximal eovering rules as well. 

In this paper we propose another approaeh to indueing maximal knowledge from 
the given rule set. The new approaeh utilizes information both on supports and 
eonfidenees of original rules. The algorithms for indueing theory as well as for 
eomputing maximal eovering rules for the theory are provided. In addition, we show 
how to test the eonsisteney of the rule set. A simple method of extraeting a eonsistent 
subset of rules is shown. 
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2 Association Rules, Rule Cover, and Maximal Covering Rules 

Let /= . . . . im) be a set of distinet literals, called items. In general, any set of 

items is called an itemset. The itemset consisting of k items will be called k-itemset. 
Let D be a set of transactions, where each transaction T is a subset of I. An 
association rule is an expression of the form Y, where 0 X,Y <= / and 
X Y = 0. Support of an itemset X is denoted by sup{X) and defined as the number 
(or the percentage) of transactions in D that contain X Support of the association rule 
X ^ Y is denoted by sup{X ^ f) and defined as sup{X u Y), Confidence of X => Y is 
denoted by conf[X ^ Y) and defined as sup{X u Y) f sup(x). The problem of mining 
association rules is to generate all rules that have sufficient support and confidence. In 
the sequel, the set of all association rules whose support is greater than S- and 
confidence is greater than c will be denoted by Ifs. and c are understood, then 

AR(s^c) will be denoted by AR. 

A notion of a cover operator was introduced in [3] for deriving a set of association 
rules from a given association rule without accessing a database. The cover C of the 
rule X ^Y.Y ^ 0, was defined as follows: 

C(X^Y)={XuZ^V Z,Vq YAZnV- 0 and 0}. 

Each rule in C(X ^ Y) consists of a subset of items occurring in the rule X Y 
The antecedent of any rule r covered by X ^ Y contains X and perhaps some items 
from Y. whereas r’s consequent is a non-empty subset of the remaining items in Y. 
The following properties of the cover operator will be used further in the paper: 

Property 1 [3]. Let rfX^Y) and (X^ f) be association rules. 

r ’ G C(r) iff X’ur c Xuf and X" ^ X. 

Next property states that every rule in the cover of another rule has support and 
confidence at least as good as those of the covering rule. 

Property 2 [3]. Let r and r be association rules. 

If fGC(r), then sup(C) > sup(r) and conf{f) > confif). 

It follows from Property 2 that if r belongs to AR(s,c). then every rule r' in C(r) 
also belongs to AR{s,c). The number of different rules in the cover of the association 
rule X ^ Y is equal to 3” - 2^, where m= Y (see [3]). 

Example 1. Let T, = {AAQDf}, Tj = {A,Bfi,Df,F}, F = {Af,C,Df,H,I}, = 
{A,B,E} and Tg = {B,C,D,E,HJ} are the only transactions in the database D. Let 
r: {B ^ BE). Then, C(r) = {B ^ BE, B ^ D, B ^ E, BD ^ E, BE ^ B}. The 
support of r is equal to 4 and its confidence is equal to 80%. The support and 
confidence of all other rules in C(r) are not less than the support and confidence of r. V 

The knowledge one can induce from the rule seti? by means of the cover operator 
is the union of the covers of all rules in 7?. The covers of different rules can overlap. It 

was shown in \5] that the intersection of covers of a set of rules ^ 1 ,^ 2 , is equal to 

0 or is a cover of the rule: XiUX 2 U...uX« ^ Fin72n...n7„, where X denotes the 
antecedent and Y^ denotes the consequent of respective r/. In particular, C(r’)nC(r) = 
C(r’) for r’ ^C(r). so r’ is less covering than T. The union of covers of the rule set R 
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can be quite large, hence it is useful to keep only a subset of rules that are most 
representative and such that the union of their covers is equal to the union of covers of 
all rules in R, This is the idea behind the notion of maximal covering rules for R 
introduced in [5], The maximal covering rules {MCR) for 7? were defined as follows: 

MCR{R) = {rsR\ Sf sR, rVr a rsC(r^)}. 

Whatever can be induced from R by the cover operator will be also induced from 
its subset MCR{R) (see [5] for the algorithm for computing MCR(R)). 

3 Consistency of Rules 

3.1 Testing Consistency of Rules 

Having been provided with the set of rules, it is reasonable to check if it is consistent 
i.e. if the are no somehow contradicting rules. Obviously, even if we know the 
method of checking the consistency of the rule set we cannot be sure if we were not 
cheated by the rules’ provider who was smart enough to deform the rules by 
adding/removing items or changing rules supports and confidences in such a way that 
the modi lied set of rules is still consistent. In the sequel of this section, we will 
concentrate on checking the consistency of the delivered set of rules. 

Let 7? be a set of rules the supports and confidences of which are known. Then the 
supports of itemsets of these rules as well as the supports of itemsets of the 
antecedents of these rules are also known. The support of the antecedent of a rule reT? 
is equal to sup{r) / confir). Applying this simple observation, we introduce the notion 
of known itemsets for R denoted by KIS{R) and defined as follows: 

KIS{R) = {AuT| X^YsR} u {X\ X^Y^R}. 

Clearly, support of Z^KIS(R) is determined in R uniquely provided there is no pair 
of different rules A=^7, X'^T in R such that: 

• Z = X^Y = XkjT and sup{X^Y) ^ sup{X^T) or 

• Z = X^Y = X and sup{X^Y) ^ sup{X) or 

• Z = X = X and sup{X) ^ sup{X). 

We define the set of rules as inconsistent iff some of the condition below is met: 

Cl. There is a rule in R or its antecedent the support of which is greater than 100% or 
is not greater than 0; 

C2. There is X€lKIS{R) the support of which is not determined in R uniquely; 

C3. There are X,Y€lK1S{R) such that their supports are determined in R uniquely and 
sup{X) < sup{Y) for XczY, 

Example 2. Let us consider two rules discovered from some hospital database: 
ri.{X) ^ {U,M} (sup.= 10%, conf=90%), ryiX.U) ^ {0} (sup.=90%, conf=10%), 
where X stands for {medical treatment = X), U for {result = Unsuccessful), M for 
{marital status = Married) and O for {age = Old), Hence, K!S{R) = {{X,U,M}, {X}, 
{X,U,0}, {X,U}}. The rules ri and r 2 determine supports of itemsets in KIS{R) 
uniquely: sup{{X,U,M}) = 10%, sup{{X}) = 10% / 90% ^ 1 1%, sup{{X,U,0}) = 90%, 
sup{{X,U}) = 90% / 10% = 900%. Nevertheless, the set of rules is inconsistent since: 
sup{{X,U}) > 100%, sup{{X}) < sup{{X,U}), sup{{X}) < sup{{X,U,0}). □ 
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3.2 Extracting Consistent Rule Set 

The problem of extracting a consistent subset of rules from an inconsistent rule set 
reminds the non-monotonic logics problem of computing a model for an inconsistent 
theory. There may be several models that are more or less suited to the reality. 
Different models can overlap partially. The ExtractConsistentRules algorithm, we 
present in this section, finds a consistent subset of rules by eliminating rules that 
violate Conditions C1-C3. 

Algorithm. ExtractConsistentRules {set of rules i?) ; 

forall rules r in R do 
r. antecedent, sup = r.sup / r.conf^ 

R' = {rei^l 0 < r.sup < 100% and 0 < r . antecedent . sup < 100%}; 

F = KIS{R' ) ; 

forall itemsets / in F do { 

f.supLlst = [supiX^Y] IX^YeR' and XUY-f} u {sup(X) \X^YeR' and X-f}; 
if \f.supLlst\ > 1 then { 

R' = R' \ {X=>YeR' I X\JY=f or X=f} } 
remove f from F; } 

else f .visited = false; }; 

while F 0 do { 
f = a maximal itemset in F; 
if f. visited = false then { 

V = all subsets of .f in F with support lower than f.sup; 

if V ^ 0 then { 

R^ = R' \ [X^YeR' \ Xt^Y=f or X=f}; 
forall itemsets v in V do 
if V. visited = false then { 

V. visited = true; 

R' = R' \ [X^YeR' \ XUY=v or X=v}; }; }; }; 
remove f f rom F; } ; 

return R ' ; 

At first, all rules whose support is incorrect are removed as well as rules whose 
antecedents have incorrect support value (Condition Cl). The remaining rules are 
assigned to R\ Known itemsets F are derived from R\ Next, for each itemset in F, it 
is created a list of support values based on the information on rules in R\ The itemsets 
with non-unique support values are removed from F altogether with the rules built 
from these itemsets or having antecedents equal to these itemsets (Condition C2). The 
remaining itemsets in F are initially marked as not visited. For each itemset feF, if 
not visited, it is checked whether there are subsets V having lower support. If so, then 
all rules corresponding to / or its subsets in V are removed (Condition C3). In the 
algorithm, / is chosen arbitrarily from among currently maximal itemsets. This 
ensures / cannot invalidate other itemsets, so it is not kept in F after its evaluation. On 
the other hand, the itemsets in V are not deleted since they may happen to invalidate 
also other supersets (unless they are currently maximal itemsets). Nevertheless, the 
itemsets in V do not relate to any rule in R' or its antecedent any longer. In order to 
avoid unnecessary evaluations, the itemsets in V are marked as visited. 

In the case of Example 2, the removal of the rule ^ 2 , the antecedent of which has 
support greater than 100%, is sufficient to obtain the consistent knowledge. 

Example 3. Let us consider the following rule set R = {vi, ^ 2 , where: 

n: {A} ^ {S,C} (sup.=35%, conf=7/12), ^ 2 : {B} {C} (sup.=30%, conf.=3/4), 

r^: {A,B} ^ {QD} (sup.=30%, conf=3/4), {B,C} ^ {E} (sup.=25%, conf=5/6), 

{A} ^ {E} (sup.=40%,conf.= l/2)}. 



// (Cl) 
// (C2) 

// (C3) 
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Now, we list the contents of KIS(K) extended by the information on support(s) for 
each itemset: KIS(K) = {{A} (60%, 80%), {B} (40%), {A,B} (40%), {A,E} (40%), 
{B,C} (30%), {AAQ (35%), {B^QE} (25%), {AAQD} (30%)}. All rules in R and 
their antecedents have acceptable values of support. However, there are two support 
values associated with the itemset {A}: the value 60% was computed as the support of 
the antecedent of the rule ri, the value 80% was computed as the support of the 
antecedent of the rule Since, we do not know which value is correct we decide to 
remove the itemset {A} altogether with the rules that were built from it (here none) 
and the rules whose antecedent is equal to {A} (here ri and ^ 5 ). Let us also note that 
the itemset {A,B,C} (35%) has greater support than its subset {B,C} (30%). Hence, 
the rules built from {A,B,C} (here ri) or {B,C} (here 7 ^ 2 ) and the rules whose 
antecedents are equal to one of the two itemsets (here will be removed (unless they 
were removed earlier as in the case of ^i). The final consistent set of rules is equal to 

{^3}. n 



4 Inducing Theory 

In the sequel, we assume /? is a consistent set of rules whose supports and confidences 
are known. In this section we propose how to induce as much knowledge as possible 
from /? (i.e. theory for /?). In order to augment the initial knowledge /? we are going to 
use the information on supports of itemsets which is available in /?. We note that for 
any itemsets X,Y,Z such that the following holds: sup{Y)>sup{X)>sup(Z). 

Hence, the support of any (unknown) itemset X can be estimated if there are Y, Z in 
KIS{R) such that ItfcZ. Applying the information on supports of itemsets in K!S{R\ 
the support of A can be assessed as follows: 

mm{sup{Y)\ YeKfS(R) a Y^ > sup(X) > max{sup{Z)\ ZeKIS(R) a X^Z}. 

Now we introduce the notion of derivable itemsets for R. Derivable itemsets for R 
will be denoted by DIS{R) and will be defined as follows: 

DIS{R) = {X\ 3Y,ZeKlS(R), Y^<^Z }. 

Obviously, D1S(R) 3 KIS(R). Let pessimistic support (pSup) and optimistic support 
(pSup) of an itemset Aef)/S'(7?) wrt. R be defined as follows: 

pSup{X,R) = mdix[sup{Z)\ Z^KIS{R) a AcZ}, 

oSup{X,R) = mm{sup(Y)\ Y^KISiR) a 7qA}. 

Then the real support of X^DIS(R) belongs to {pSup(X,RX oSup(X,R)]. Clearly, if 
XsKIS(R), then sup(X)=pSup(X,R)=oSup(X,R), 

Property 3. Let X,Y^DIS(R) and Ae7. Then: 

• pSup(X,R) > pSup{Y,R\ 

• oSup{X,R) > oSup{Y,R). 

Knowing DIS{R) one can induce (approximate) rules A=>7 provided 
Au7 E DIS{R) and X e DIS{R). The pessimistic confidence (pConf) of induced rules 
is defined as follows: 



pConfiX^Yfi) = pSup{XKjY,R) / oSup{X,R). 
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Property 4. Let X,YeDIS(R) and VcJ(. Then: 

pConfiX^Y,R) > pConJ(XIV^Y^V,R). 

Now we will introduce the notion of theory for the rule set R formally. Theory for 
R will be denoted by T and defined as follows: 

T{R) = {X^Y\ XuY E DIS(R) and X e DIS(R)} , 

It is guaranteed for every rule rsT{R) that its support is at least as good as 
pSup{r,R) and its confidence is at least as good as pConf{r,R), Sometimes, we may be 
interested only in deriving the rules from R whose (pessimistic) support is greater than 
.s: and whose (pessimistic) confidence is greater than c. All rules derivable from R that 
satisfy these conditions will be denoted by T{R,s,c), In particular, one may be 
interested in discovery of T(R,s,c), where s = m\r\{sup{r)\ r^R} and c = mm{conf(r)\ 
r^R}. In what follows, we offer the GenTheory algorithm which computes T{R,s,c), 
In particular, for .s'=0 and c=0 the result will be equal to T{R), The algorithm is a 
modification o^t\\e Apr ioriGenRules algorithm [2]. 

Algorithm. GenTheory ( set of rules R, min. sup. Sr min, conf, c) ; 

D = [ZeDISiR] I pSup(Z)>s}; 

forall if-itemsets ZeD, k > 2, do {; 

Hi = { { r} I YeZ}j // 1-item consequents 

for (i = 1; {Hi ^ 0) and (i < h) ; i++) do { 
forall itemsets ZeHi do { 

X = Z\Y; 

±f X e D then { 
pConf = pSupiZ] / oSL 2 p(X); 
if pConf > c then 

print the rule X^Y with confidence = pConf and support = pSup{Z) ; 

else 

delete Y from Hi; }; ); 

Hi+i = AprioriGen{Hi) ; ); ); // k-item consequents are generated 

The GenTheory algorithm assigns to D the itemsets from DIS(R) whose pessimistic 
support is greater than From each ^-itemset in Z, k>2, there are created candidate 
rules of the length L At first, there are considered the candidate rules with single item 
consequents and derivable antecedents (i.e. belonging to D), If the pessimistic 
confidence of a candidate rule is greater than c, then the candidate rule belongs to 
T(R,s,c). Next, the AprioriGen function (see [2] for details) generates candidate rules 
with 2-item consequents from the 1-item consequents of the discovered rules. (The 
consequents Y of candidate rules X^Y that turned out not to have sufficient 
confidence are not taken into account since any rule created from Z=X^Y with 
consequent containing T, e.g. XIV^YkjV, will not have sufficient pessimistic 
confidence either). In general, each i-th iteration looks as follows: 

• Evaluate candidate rules with /-item consequents and derivable antecedents; 

• If a candidate rule has sufficient confidence print it out; otherwise remove its 
consequent from the set of /-item consequents; 

• Generate (/+l)-item consequents from the remaining /-item consequents. 

5 Maximal Covering Rules for Theory 

In this section we consider generation of maximal covering rules for the theory for R 
(i.e. MCR(T{R))). Let us start with the property of rules generated from DIS{R): 
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Property 6. Let Z^DIS{R\ Z' ^K1S{R\ and pSup(Z,R) = sup{Z'). 

• pSupix^ziX) = sup(x^r/X), 

• pConfiX^ZIX) = pConfiX^riX), 

• If^=^Z/A^ e T(R,s,cX thcnX^r/X e T(R,s,c% 

• X^Z/X £ C(X^r/X). 

Observations: 

01. It follows from the definition of DIS(R) and Property 6 that for every rule in 
T{R,s,c) built from an itemset in DIS{R)\KIS{R) there is a covering rule in T{R,s,c) 
built from an itemset in KIS{R), This implies that no rule built from an itemset in 
DIS{R)\KIS{R) is a maximal covering rule. Hence, generation of candidate 
maximal covering rules can be restricted to generating rules from K!S{R), 

02. Property 6 implies that no rule X ^ ZX built from Z^KIS{R) is maximal 
covering if there is a proper superset Z" ^KIS{R) of Z having the same 
(pessimistic) support as Z. 

The two observations were used in the FastGenMaxCoveringRules algorithm that 
computes MCR{T{R)). Our algorithm is a modillcation of the 
FastGenAllRepresentatives, we proposed in [4]. 

Algorithm. FastGenMaxCoveringRules {set of rules Rr min, sup. Sr min, conf, c) ; 

K= [ZeKIS{R)\ pSup{Z)>s}; 

D = [ZeDISiR] \ 3YeK, YnsZ} ; 
forall A--itemsets Z e K, k > 2, do { 
maxSup = max{{sup{Z' ) \ Z<zZ' eK} u {0}); 
if Z.sup ^ maxSup then { 

^1 = {{X}| XeZ}; // create 1-item antecedents 

for (i = 1; [A± 0) and (i < k) j i + + ) do { // loop 

forall itemsets X e A± r\ D do { 
pConf = sup(Z) / oSup(X); 

/* Is X => Z\X an association rule? */ 
if pConf > c then { 

/*Isn*t any longer assoc, rule X^Z' \X that covers X^Z\X?* / 
if {maxSup / oSup{X) < c) then 

print the rule X'=>Z\X with support = sup{Z) and confidence = pConf; 

/* Antecedents of association rules are not extended */ 

Ai = Ai \ [X}; }; }; 

Ai+i = AprloriGeniAi) ; }; }; }; // compute (i+l)-item antecedents 

The FastGenMaxCoveringRules starts with computing subsets K and D that consist 
from the itemsets in K!S{R) and DIS{R\ respectively, which have sufllcient support. 
The algorithm computes maximal covering rules from each A:-itemset, k>2, in K. Let Z 
be a considered itemset in K, Only A:-rules are generated from Z. First, maxSup is 
detenuined as a maximum from the supports of the itemsets in K that are proper 
supersets of Z. If there is no proper superset ofZ in AT, then maxSup=^. \[ sup{Z) is the 
same as maxSup, then no maximal covering rule can be generated from Z because 
there is some proper superset of Z with support equal to sup{Z) (Observation 02). 
Otherwise, 1-item antecedents of candidate rules are created. The loop starts. In 
general, the /-th iteration of loop looks as follows: 

Each candidate X ^ ZSX, where XaZ belongs to derivable /-itemsets in Ai, is 
considered. The candidate rule with sufficient pessimistic confidence that does not 
belong to the cover of a rule created from a proper superset of Z is maximal covering. 

When all maximal covering rules with /-item antecedents are found then the 
(/+l)-item antecedents are created by the AprioriGen function from the /-item 
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antecedents of rules whose pessimistic confidence is not greater than the threshold c. 
This fact can be justified as follows: Let X => Z\X be a rule with sufficient pessimistic 
confidence. Then a rule X' => Z\X\ where belongs to C(X ^ Z\X). So, 

X' ^ ZX' is not maximal covering. Hence, it does not make sense to create candidate 
rules with antecedents being supersets of antecedents of rules with sufficient 
pessimistic confidence since we know apriori that they are not maximal covering. 

The algorithm ends if the set of antecedents of candidate rules is empty. 

6 Related Work 

It was shown in [5] how to apply the cover operator C and so called extension 
operator E in order to induce as much as possible from the original rule set R. The 
extension operator for the rule set R (denoted by E{r,K)) was defined as follows: 

E(r:X^ T, R) = {X^ (X^T)\X\ 3E: X a sup(r) = sup(E)}. 

Let us stress that the cover operator does not need any information on support of rules 
so that to generate rules, which are not weaker than the original rules in R, The 
extension operator requires the knowledge on supports of rules, but does not apply the 
information on supports of the antecedents of rules. Therefore, we deduce that T{R\ 
computed with the use of the knowledge on supports of rules and their antecedents 
produces a superset of the rules one would obtain by applying both the cover and 
extension operators to R (possibly many times - as long as no new knowledge can be 
generated [5]). 

7 Conclusions 

It was shown in the paper how to induce as much knowledge as possible from the 
given rule set. The algorithms for inducing theory as well as for deriving maximal 
covering rules for theory were offered. Unlike in [5], the whole theory for the rule set 
can be computed at once. It was shown how to test the consistency of rules and how 
to extract consistent subset of rules. 
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Abstract. This paper presents the MicroNOMAD Discovering tool. The tool 
combines both topographic structures, textual and iconographic interaction, as 
well as viewpoint exchanges by the mean of a major extension of the classical 
Kohonen SOM model. It may be used for various discovering tasks in a 
multimedia Digital Library context. The tool basic principles are firstly 
described. Finally, a tool experimentation that has been carried out on the 
multimedia database associated to the BIBAN "Art Nouveau" server is 
presented. 



1 Introduction 

Most of the Digital Library architectures are derived from Information Retrieval 
models and are designed to help end-users in retrieving information «they already 
know but they have lost a link to». Indeed, such architectures give no ways to user for 
exploring the knowledge of a corpus in order to answer questions like this: «what is 
the most important feature in this topic? whafs new? etc.». The first aim of our 
approach is then to build up Information Discovering Systems rather than Information 
Retrieval Systems. The second aim is to show that exploiting the existing interaction 
between texts and images in a multimedia Digital Library context may well facilitate 
the contents interpretation. To achieve this, our MicroNOMAD tool core model 
strongly derives from the multimap topographic model, which has been successfully 
tested on textual data in the framework of the NOMAD IR System [4]. This latter 
model, which may be considered as an extension of the basic Kohonen’ s topographic 
map model, enables the user to browse through a documentary database by means of 
an advanced topographic interface. To take benefit of the discovering and browsing 
properties of the NOMAD multimap model in a multimedia context, we have mainly 
based our adaptation of the original model to the MicroNOMAD approach on a 
parallel implementation of a thematic mapping and of an image mapping on the same 
maps. 

In a first part we will explain the basics of our new Discovering tool. In a second 
part we will conclude with experiments on this tool. 
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2 The MicroNOMAD Adaptive Model 

The MicroNOMAD basic image classification process is based on the Kohonen 
topographic map model [3]. In this model a data classification is seen as a mapping on 
a 2D neuron grid in which neurons establish predefined neighborhood relation. After 
the classification process, each neuron of the map will then play the role of a data 
class representative. The main advantages of the Kohonen map model, as compared to 
other classification models, are its natural robustness and its very good illustrative 
power. Indeed, it has been successfully applied for several classification tasks [5] [6] 
[7]. In our own case, each topographic map is initially built up by unsupervised 
competitive learning carried out on the whole multimedia database. This learning 
takes place through the profile vectors extracted from the image descriptions, which 
describe the characteristics of these images in the viewpoint associated to the map. 
For each neuron n of a map M, the basic competitive learning function has the 
following classical form: 

wy'=W>a(t)k(t)(W‘*-P‘). (1) 

The topological properties associated with the Kohonen maps make it then possible 
to project the original images (i.e. data) onto a map, so that their proximity on the map 
corresponds as closely as possible to their proximity in the viewpoint^ associated to 
said map. Once associated to a neuron, an image could be considered as a member of 
the class described by this neuron. 

After the preliminary learning phase, each map is organized through analysis of the 
main components of the neuron profiles, so as to be legible for the user. A first phase 
of this analysis consists in defining class names that can optimally represent the class 
contents when the map is displayed to the user. The second phase of the analysis 
consist in dividing the map into coherent logical areas or neurons groups [4] [5]. Each 
area, which can be regarded as a macro-class of synthesis, yields a very reliable 
information on the relative importance of the different themes described by the map. 

The communication between Kohonen maps that has been first introduced in the 
NOMAD IR model [4] represents a major amelioration of the basic Kohonen model. 
In MicroNOMAD, this communication is based on the use of the images that have 
been projected onto the maps as intermediaries neurons or activity transmitters 
between maps. The communication process between maps can be divided in two 
successive steps: original activity setting on source maps (1) and activity transmission 
to target maps (2). 

The original activity may be directly set up by the user on the neuron or on the 
logical areas of a source map. This protocol can be interpreted as the user’s choices to 
highlight (positively or negatively) different themes representing his centers of 
interest relatively to the viewpoint associated to the source map. The original activity 
may also be indirectly set up by the projection of an user’s query on the neurons of a 
source map. The effect of this process will then be to highlight the themes that are 
more or less related to that query. 

The activity transmission can be considered as a process of evaluation of the 
semantic correlations existing between themes of a source viewpoint (source map) 



^ The "viewpoint" notion is an original notion that has been firstly introdueed in the NOMAD 
IR system for playing the role of semantie eontext of retrieval [4]. 
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and themes belonging to several other viewpoints (target maps). It ean be also 
interpreted as a general form of unsupervised inferenee. The aetivity of a elass i of a 
target map T derived from the aetivity of a souree map S is eomputed by the formula: 

AT=f„^,(g(A„)),A„=g(AS). (2) 

Two modes (eorresponding to the above mentioned f funetion) are used for the 
eomputation of the transmitted aetivity. In the "possibilistie" mode, eaeh elass of the 
target maps inherited of the aetivity transmitted by its most aetivated assoeiated data. 
This approaeh eould help the user to deteet weak semantie eorrelation (weak signals) 
existing between themes belonging to different viewpoints. In the probabilistie mode, 
eaeh elass inherited of the average aetivity transmitted by its assoeiated data, whether 
aetivated or not. As opposed to the possibilistie eomputation, the probabilistie 
eomputation yields a more reliable measure of the semantie eorrelations and may be 
then used to differentiate between strong and weak matehing. 



3 Experimentation 

A first experiment was earned out with the MieroNOMAD Diseovering tool on the 
multimedia database "Art Nouveau" managed by the BIBAN server [2]. This database 
eontains approximately 300 images related to the various artistie works of the Art 
Nouveau Sehool. It eovers several domains, sueh as arehiteeture, painting and 
seulpture. The images have assoeiated bibliographie description containing optionally 
title, indexer keywords and author information. We have decided to use 3 different 
viewpoints (profiles) in our experiment: 

1. The "Indexer keywords" viewpoint. Its is represented by the keywords set used by 
the indexer in the keyword description field of the images. 

2. The "Title keywords" viewpoint. Its associated keywords set is automatically build 
up through a basic keywords extraction (use of a stop word list and plural to 
singular conversion) of image titles. After the keywords extraction a new "Title 
keywords" field is added to the image description. 

3. The "Authors" viewpoint. It is represented by the set of authors cited in the image 
descriptions. 

The first step of the experiment consists in transforming the image description 
associated to the chosen viewpoints in profile vectors. For that step, we have also 
chosen to apply a classical Log-Normalization step [9], in order to reduce the 
influence of the most widespread words of the profiles. The second step is the original 
classifications building. It has been implemented through the classical Kohonen 
SOMPACK algorithm [7]. The results, which consist in three different classifications 
associated with the three different viewpoints are then "dressed" and converted to 
XML format. For the sake of portability, the core of the MieroNOMAD Discovering 
tool has been developed as a Java application. Its entries are the XML classification 
files produced in the preceding step and it implements the class naming strategies, the 
maps division into logical areas, and the above described intermap communication 
process. 

Original multiple viewpoints classification approach have directly produced very 
interesting results proving again the relevance of such an approach which aims at 
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reducing the noise which is inevitably generated in an overall classification approach 
whereas flexibility and granularity of the analyses are increased. As an example, in 
our experiment we found that a "Title keywords" classification can highlight 
information that is very complementary to the one highlighted by an "Indexer 
keywords" classification. 




Fig. 1. Partial view of the "Title keywords" map 



Maps also represent an useful tool for the indexation specialists. They help them in 
estimating the quality of the indexation of a database. Thanks to the classification 
method, strong indexation incoherences can be easily found out on the map: such 
incoherences are obvious if themes that specialists judged of equal weight in a domain 
appear with strongly different surface areas on a map. 

After experimentation with several users, the opportunity to have simultaneously 
images and coherently organized textual information on the same support (map) 
seems to be definitely of great utility. Classification results interpretation are really 
made easier by the presence of images, as well as text represent a good help in the 
choice of reliable browsing points in the multimedia database. 

According to user’s opinion, the intermap communication process appears to be a 
very interesting and original feature of the model. It provides the system with a new 
capability that may be called a dynamic and flexible browsing behavior. As opposed 
to classical browsing mechanism, like hypertext links, the browsing effect could then 
be directly tied to the user’s information and explanation needs. Moreover the number 
and the type of viewpoints (i.e. concurrent or complementary) that can possibly be 
simultaneously used are not limited by the model. 

These last properties could lead us to consider our approach as a good basis for 
building intelligent multimedia discovering systems, especially for the ones that are 
strongly tied to image interpretation. Let us mention that the model is now tested for 
two important applications: 

1. Interactive browsing through museum database and intelligent setting up of 
exhibitions in the framework of the technical collection of the French scientific 
museum "Musee de la Villette". 

2. Management of multiple classifications of butterflies (color, shape, ...) in the 
Taiwanese NSC Digital museum of butterflies [2]. 
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4 Conclusion 

The MicroNOMAD discovering tool development represents obviously an important 
step for providing an iconographic interface to Digital Library Server with a high 
level of interactivity. We have said that the first reactions we received in 
demonstrating it in the BIBAN server context were very encouraging. Nevertheless, 
we have still a lot of work to do if we want to put such an interface on the Internet or 
to produce a tool enabling anyone to build up this kind of application. The basic 
browsing and querying capabilities of our tool seem to be well-suited to overall 
browsing and querying tasks, whatever the user’s abilities may be. Nevertheless, the 
relative difficulty for the non-specialists of precisely analyzing the classification 
results that are produced by the tool (and working on them) is the real challenge. As 
shown in this paper, sophisticated tools yield better hypotheses but they are more 
difficult to validate. Domain specialists who want to get effective results by a deeper 
exploitation of both the expressive and the discovering power of the MicroNOMAD 
tool, but who are not familiar with neural theories and their background behavior are 
also a source of confusion. The MicroNOMAD multimap core model will be very 
useful to them in proposing new assumptions, but it will have to be interfaced with 
very simple tools, thus enabling non classification specialists to check the proposed 
assumptions. To achieve this, we have planned to interface our model with such a 
validation tool based on Gallois lattice and logical inference [8]. 
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Abstract. Data with plural sources are handled in an extended rela- 
tional model under a modal logic approach. A source of information 
corresponds to a possible world and relationships between sources are 
expressed by an accessibility relation. An attribute value in a database 
relation is the triplet that consists of a set of possible worlds, an acces- 
sibility relation and a set of value assignment functions. A tuple in a 
relation consists of a tuple value and a membership attribute value to 
which the tuple value belongs to the relation. The degree, not always 
equal to 1 or 0, comes from different values being obtained from plural 
sources. Eurthermore, the degree is expressed by two approximate values; 
namely, one is a degree in the lower limit that means necessity; the other 
is a degree in the upper limit that means possibility. This comes from 
sources having some relationships with others. Simple queries containing 
elementary formulas and logical operators are shown in the extended re- 
lational model. 

Keywords: Plural sources. Data modeling. Extended relational databases. 
Modal logic. 



1 Introduction 

Imperfection pervades the real world. Environments without imperfection are 
exceptional for the real world [9]. We realize some aspects of the real world in 
database systems by using information obtained from the real world. Thus, we 
cannot obtain practically useful databases in various fields without considering 
imperfect information. 

Thus far, many investigations have been carried out to make databases truly 
realistic. Following LipskiA pioneering works [5], several extended versions of rela- 
tional models have been proposed to deal with a kind of imperfect information [2, 
9]. These extended relational models are constructed under the premise that all 

W. Ziarko and Y. Yao (Eds.): RSCTC 2000, LNAI 2005, pp. 404-411, 2001. 
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data for an entity are obtained from the only one sonrce of information. As a 
matter of fact, we nsnally enconnter plnral sonrces for an entity in environments 
accompanied with estimates. For example, when more than one analyst pnblishs 
their report for estimating a growth rate of a company, investors have plnral 
sonrces of the estimated growth rate for the company. In addition, an analyst 
may have some relationships with others. Onr objective is to show a fnndamental 
framework handling data with plnral sonrces of information in databases. 

The present paper is organized as follows: in section 2 data with plnral sonrces 
of information are expressed on the basis of modal logic approach; section 3 is 
assigned to describing an extended relational model and simple select operators; 
the last section is conclnding remarks. 



2 Modal Logic and Sources of Information 

2.1 Modal Logic 

Modal logic is an extension of classical propositional logic. Its langnage consists 
of the set of atomic propositions or propositional variables, logical connectives 
- 1 , A, V, modal operators of necessity □ and possibility O, and snpporting 

symbols (, ), {, }, .... An atomic proposition is a formnla. The meaning of a 
formnla is its trnth valne in a given context. Varions contexts are nsnally ex- 
pressed in terms of modal logic. A model M of modal logic is the triple (>V, IZ, V) 
where >V, IZ, and V denote a set of possible worlds Wi, a binary relation on >V, 
and a set of valne assignment fnnctions i/i, respectively. One for each world in 
>V, by which trnth(t) or falsity (f), is assigned to each atomic proposition. Valne 
assignment fnnctions are indnctively extended to all formnlas in the nsnal way. 
The only interesting cases are: 

= t iff for all Wj G W (tc*, Wj) ^IZ implies i^j{p) = t, 

and 

= t iff there is some Wj G W snch that (tc*, Wj) ^IZ and i^j{p) = t. 

A relation IZ is nsnally called an accessibility relation; we say that a world Wj 
is accessible from a world Wi when {wi^Wj) G TZ. It is convenient to represent 
the relation 7^ by an n x n matrix IZ — [r*j] when n is the nnmber of possible 
worlds, where 

n - = [^ *7 (w'i) w'i) e 
1^ 0 otherwise^ 

and to define for each world Wi G W and each formnla p 






1 if = t, 

0 otherwise. 
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2.2 Plural Sources of Information 

When an entity has plnral sonrces of information, those sonrces do not always 
have the same contents. By data obtained from a sonrce we only have the pos- 
sibility that the data are correct. Thns, each sonrce of information for an entity 
corresponds to a possible world and in the possible world it is trne that the 
entity is characterized by the information. When an entity has plnral sonrces of 
information with the same contents, this is eqnivalent to that the entity has the 
only sonrce of information. 

A possible world is characterized by that a proposition is trne. To model this, 
we nse the following proposition: 

: ”A valne is a given element 

where x is an element in a given nni verse of disconrse T. For each world Wi G W, 
it is assnmed that i'i{e{x}) — t for one and only one x G T. Thns, in mnltiple 
worlds of some modal logic, a proposition for some x is valned differently in 
different worlds. In other words, the proposition is trne for a world tc*, bnt 
false for another world Wj. 

Example 1 

Let an entity have u and v as an attribnte valne for possible worlds w\ and IC2, 
respectively. Then, propositions and are trne for the possible worlds w\ 
and IC2, respectively. This means that valne assignment fnnctions are z/i(ej^T.) = 
t and i/2{e{v}) — t for the possible worlds w\ and IC2, respectively. 

Sonrces of information nsnally have some relationships with others. This is 
expressed by an accessibility relation that is a binary relation. For the sake of 
simplicity, let two sonrces of information exist for an attribnte of an entity. This 
is expressed by a set of possible worlds >V(= {wi,W2}). When the two sonrces 
have no relationship with each other, an accessibility relation IZ is: 




The possible worlds are not accessible to each other; namely, the two sonrces are 
isolated. 

When two sonrces have relationships with each other. 




This is an eqnivalence relation. Possible worlds are accessible to each other. This 
means that each possible world accepts that a proposition holds in the other 
possible world. 

When the first possible world accepts that a proposition holds in the second 
possible world, bnt the second possible world does not so. 



7 ^ = 



1 1 
0 1 
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The accessibility relation is asymmetry. This often appears in onr daily life; for 
example, A accepts B, bnt B does not A, 



Example 2 

An entity has five sonrces of information for a valne of an attribnte Ai and 
obtained valnes of the attribnte are a, 6 , c, c, and d, respectively. The first 
sonrce is isolated. The second and the third sonrces are accessible to each other. 
The fonrth sonrce is accessible to the fifth, bnt the fifth sonrce is not to the 
fonrth. Then, an accessibility relation IZ is 



( 1 



7 ^ = 



1 1 
1 1 



0 



0 



1 1 

0 1 / 



A set V of valne assignment fnnctions consists of five valne assignment valna- 
tion functions; namely, J^i(e{„}) = t, J^2(e{&}) = t, J^3(e{c}) = t, J^4(e{c}) = t, 
vry{e{d}) = t. 



3 Extended Relational Model 

We develop an extended relational model where plnral sonrces of information 
exist. First, we address the framework of the extended relational model, and 
then describe select operators that are most freqnently nsed in qnery processing. 
The other operators consisting of relational algebra will be addressed in another 
paper. 

3.1 Framework 
Definition 

A extended relational scheme R consists of a set of conventional attribntes A — 
{Ai, A2, , Am} and a membership attribnte fi\ namely, 

i? = A U {//} = {A \ , A2, . . . , Am^ ^}j 

where m is the nnmber of conventional attribntes. 

Definition 

The valne t[Ai\ of an attribnte Ai in a tnple t is represented by: 



where Wt[A^]^ and Vt[A^] for t[Ai] denote, respectively, a set of possible 

worlds, an accessibility relation, and a set of valne assignment fnnctions. 
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In addition to the above extension, we introdnce a membership attribnte, as is 
done in data models handling a kind of imperfect information [1, 3, 4, 6]. 

Definition 

Each tnple valne t[A] in a database relation r has its membership valne t[/j] 
that is expressed by (t [//□], where t[fin] (resp. is a degree in neces- 

sity(resp. in possibility) that the tnple valne t[A] belongs to r; namely, 

r = {t[^n],t[^o])) I t[^o] > 0}. 



As is addressed in the definition, a relation is an extended set having tnple valnes 
as elements. 

A degree that a tnple valne belongs to a relation comes from to what extent 
the tnple valne is compatible with imposed restrictions on the relation. At the 
level of tnple valnes restrictions imposed on a relation specify a set of valnes that 
each tnple can take. When an attribnte valne in a tnple contains data obtained 
from plnral sonrces of information, the nnmber of possible worlds that is related 
with the tnple is plnral and generally different possible worlds are associated 
with different valnes. A tnple valne associated with a possible world is contained 
in a set of tnple valnes specified by restrictions, bnt another valne with a different 
possible world does not so. Thns, A valne of membership attribnte is not always 
eqnal to 0 or 1. Moreover, some possible worlds are associated with others by an 
accessibility relation. This leads to that the valne of membership attribnte is not 
obtained in a exact single valne, bnt a pair of approximate valnes; namely, the 
lower limit and the npper limit that means necessity and possibility, respectively. 

The membership valnes are calcnlated, bnt not given in the present frame- 
work. Snppose that t[A] is expressed by {Wt[A],^t[A]^Vt[A]) and |>Vt[^]| = n. 
Compatibility degrees of the tnple valne t[A] with restrictions C in necessity 
and in possibility are: 

Nec{C\t[A]) = T{B{C\t[A]))/n, Pos{C\t[A]) = T{0{C\t[A]))/n, 

where T{0(^C\t[A])) (resp. T(0(C|t[M]))) is the nnmber of worlds in which the 
tnple valne t[A] is compatible with the restrictions C in necessity (resp. in pos- 
sibility): 

T{a{C\t[A])) = J2T{Ma{C\t[A]))), T{0{C\t[A])) = J2T{MO{C\t[A]))), 

i i 

where T{i/i{^{C\t[A\))) (resp. T{i/i{0{C\t[A])))) is eqnal to one when the tnple 
valne t[A] in a possible world Wi is compatible with the restrictions C in necessity 
(resp. in possibility). For each possible world Wi G yVt[A]^ 

i^i{^{C\t[A])) = t = ^wj G Wt[A] {m, wj) G lZt[A] ^ ^j{C\t[A\) = t. 

This formnla means that for each possible world Wi G yVt[A]^ ^*(*^(^l^[-4])) = t 
if and only if i^j{C\t[A]) = t in all possible worlds Wj that are accessible from 
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the possible world Wi. Similarly, 

i^i{0{C\t[A])) = t = 3wj e Wt[A] {m, wj) e lZt[A] a i^j{C\t[A]) = t. 

This formula means that for each possible world Wi G i^i{^{C\t[A])) = t 

if and only if i^j{C\t[A]) = t in at least one possible world accessible from the 
possible world Wi. Thus, 7Vec(f7|t[^]) (resp. Pos{C\t[A])) is the number ratio of 
possible worlds that are compatible with the restrictions C in necessity (resp. in 
possibility). 

Example 3 

{W,1Z,V)ai for an attribute value Ai of an entity is already obtained as is 
shown in example 2. We suppose that the only restriction C is Ti = {6, c} that 
is imposed on the relation to which the entity belongs. We obtain a membership 
attribute value (2/5, 3/5) from N ec{C\t[A]) = 2/5 and Pos{C\t[A]) = 3/5 by 
using the above formulas. 

Tuples with low values of membership attributes may appear in database 
relations. We use a pair of values (t[//n t[//o,r]) as a membership attribute 

value for a tuple t in a relation r. These values are degrees in necessity and in 
possibility that the tuple value t[A] belongs to the relation. From the degrees we 
obtain degrees {t[/j,a^r]A[fJ'<>,r\) that t[A] does not belong to r; namely, 

t[fia,r] = 1 - i[/^0,r] = 1 “ 

We set the following criterion for accepting a tuple in a relation: 

The degrees in necessity (resp. in possibility) to which a tuple value t[A] belongs 
to a relation r are greater than or egual to the degree to which the tuple value 
does not so. 

Namely, t[pn^r] P which is equivalent to t[//o,r] N for an accepted 

tuple t in a relation r. This means that t[pa,r]\ + N 1. The tuple that 

does not satisfy this criterion should be discarded by users. 



3.2 Query Evaluations 

We address how to extract desired information from our extended relational 
database. A query is performed by using select operations in relational databases, 
where it is evaluated to what degree a data value is compatible with a select 
condition P. T consists of elementary formulas and logical operators and(A), 
or(V), and not{—). A typical elementary formula is Ak is m” where m is a 
predicate. 

The compatibility degree of a tuple value t[A] with a select condition T 
is expressed by a pair of degrees N ec{P\t[A\) and Pos{P\t[A\) that denote to 
what extent the tuple value is compatible with T in necessity and in possibility. 
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respectively. Suppose that t[A] is expressed by {yVt[A]^^t[A]^^t[A]) |Wt[^]| 

= n. 

7Vec(J^|t[^]) = T{B{T\t[A]))/n, Pos{T\t[A]) = T{0{P\t[A]))/n, 

where T{0(^P\t[A])) and T{0{P\t[A])) are evaluated by the same method as 
addressed in the previous subsection. For a negative select condition -ijT, 

7Vec(-J^|t[^]) = 1 - Pos{P\t[A]), Pos{^P\t[A]) = 1 - Nec{P\t[A]), 

The results of select operations are database relations having the same struc- 
ture as the original ones. A membership attribute value for a tuple t in a derived 
relation is 



{N ec{P\t[A]) X Pos{P\t[A]) x t[//o]), 

where (t [//□], t[/^o]) is a membership attribute value in the original relation and 
X is arithmetic product, because the compatibility degree is the number ratio 
of possible worlds that are compatible with P, 

Queries can be classified into atomic ones and compound ones. Atomic queries 
are ones in which their select condition is expressed by an elementary formula 
or its negation. We show an example for an atomic query. 

Example 4 

The entity addressed in example 3 has another attribute A 2 . Values of the at- 
tribute that are obtained from four sources are 30 that comes from the first to 
the third and 20 that does from the fourth, respectively. The first and the second 
sources have nothing to do with the others. The third and the fourth sources 
accept each other. Suppose a select condition P: A 2 is {30,40}; namely, A 2 is 
equal to 30 or 40. We calculate a compatibility degree of the tuple value t[A] 
with the select condition. The first and the second possible worlds are certainly 
compatible with the select condition. All possible worlds are possibly compati- 
ble with the condition. So, we get N ec{P\t[A]) = 2/4 and Pos{P\t[A]) = 4/4. 
Considering the membership attribute value (2/5, 3/5) in the original relation, 
we obtain a membership attribute value (1/5, 3/5) from (2/4 x 2/5, 4/4 x 3/5). 

Compound queries have a select condition P containing logical operators 
and(A) and/or or(V). We suppose that the select condition P is composed of 
elementary formulas /i and /2 in a compound query. If /i and /2 is noninteractive 
to each other; in other words, they do not contain common attributes at all, the 
compound query can be calculated by evaluations of two atomic queries; namely, 

NeAh A/ 2 |^[a]) = Nec{fi\t[A]) x N ec{f 2 \t[A]) , 

Pos{fi Af 2 \t[A]) < Pos{fl\t[A]) X Pos{f 2 \t[A]), 

Nec{fi V f2\t[A]) > Nec{fi\t[A]) + Nec{f 2 \t[A]) - Nec{fi\t[A]) x Nec{f 2 \t[A]). 
Pos{fl V f 2 \t[A]) = PosifAA]) + -Pos(/ 2 |t[a]) - Pos{fl\t[A]) X Pos{f 2 \t[A]), 
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where the equality holds when /i and /2 are noninter active; namely, /i and /2 
do not contain common attributes. 

When /i and /2 are interactive, suppose that /i is Ai is Si and /2 is Ai is 82- 
We calculate the compatibility degree P os{ fi /\ f2\t[A\) in possibility by resetting 
such that T is A{ is Pi S2 in the case of P = fi A f 2 and the compatibility 
degree Nec{fi V / 2 |^[- 4 ]) in necessity by resetting such that T is A{ is U S2 
in the case of^=/i V/2. 

4 Concluding Remarks 

We have developed relational databases with plural sources of information under 
a modal logic approach. Each source is expressed by a possible world where a 
proposition holds. Furthermore, the relationships between sources are considered 
by using an accessibility relation. An attribute value is the triplet that consists 
of a set of possible worlds, an accessibility relation, and a set of value assignment 
functions. In the approach all pieces of information obtained for an entity are 
kept without fusing. This leads to that update of information can be easily 
made. In addition, our approach can be extended into databases with imperfect 
infomation[ 7 ]. Thus, our approach gives significant bases to handling data with 
plural sources of information. 
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Abstract. In this paper, we present agents based tool to discover new 
research topics from the information available on the World Wide Web 
(WWW) . Agents are using KARDKA (Keywords Association Rules Op- 
timizer Knobots Advisers). KAROKA is a model of discovery in text 
database used in WWW. The WWW sources are converted to a highly 
structured collection of text. Then, KAROKA tries to extract associa- 
tion rules, regularities and useful intbrmation in the collection of text. 
KAROKA techniques are described such as information retrieval simi- 
larity metrics for text, generation and pruning of keywords combination, 
and summary proposal of discovered information. 



1 Introduction 

When a user explores a new domain, attempting to summarize the essence of 
an area previously unknown to the user, it is called information and knowledge 
discovery [1]. 

Information Discovery Agent (Web Mining Agent) is an important kind of 
information seeking agent trying to realize the previously mentioned tasks for 
text documents with images or sounds on the Web. 

Existing Information Discovery agents are currently unable to produce them- 
selves rational analyses of retrieved information on the World Wide Web be- 
cause the Web is not structured information sources. These agents are processing 
user-queries and returning high quality results to the user satisfying the users 
preferences without bringing new ideas or unexpected interesting discovery. 
Intelligent Information Discovery Agents should have some mechanisms to dis- 
tinguish the irrelevant data and unexpected interesting results. 

Current systems are focusing on users relevance feedback, preferences as de- 
scribed to supervised-learning systems. Some systems are using the event de- 
tection and tracking described as unsupervised- clustering algorithms to search 
high quality information and extract the maximum knowledge. These methods 
do not bring novelty, utility, and under standability to the results [5]. 

The domain we focus on is the discovery of Research Topics on “Data Mining 
and Knowledge Discovery and Their Applications.” We try to determine “What 
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are possible and promising research topics according to the information on the 
Web in this research domain.” 

We introduce KAB,OKA (KeAjwords Association B,ules Optimizer Knohots 
Advisers), a personalized system that pro-actively tries to discover information 
from various distributed sources and presents it to the user in the form of a 
digest, KAROKA is using a tool similar to “AltaVista Discovery” [2], Karnalc 
[3] or CiteSeer [4] for the exploration of World Wide Web. 

In section 2, we describe the related works to our research. The KAROKA 
model is explained in section 3. Section 4 details the example of KAROKA 
use and experimental results in ''data mining trends and forecasts. ” Section 5 
summarizes the paper and describes our future work. 



2 Related Works 

We describe here the status of the current research on information retrieval and 
information discovery agents which may have impact on our own research. 



2.1 Information Retrieval Agents 

Although our work is different from information retrieval, we arc using search 
engine to our system. We briefly describe here a survey. 

Search engines have been built in the past that focus on indexing the WWW, 
thus providing an easier way of finding documents. A common problem of WWW 
indexing engines is that they frequently return documents that arc irrelevant to 
what the user is interested in. Sometimes this is caused by a poor choice of 
keywords by the user and sometimes by the poor indexing of the documents by 
the engine. Personalization of the system can form more exact queries. 

Another problem of search engines is the processing of information available 
in the Web. For any given query, there are often simply too many relevant 
documents for the user to cope with the result efficiently. This work is beyond 
the search engine processing. 

In the survey described recently by Mladenic [6] , personalized agents are de- 
veloped with machine learning (text- learning) or data- mining techniques. Text- 
learning is applied on collected information to help users browsing the Web or 
ameliorating their searches. 

Information retrieval utilizes the weighted keyword vector as a general method 
of document representation. Text-learning systems arc using feature selection 
and classification methods to customize to individual users for the personaliza- 
tion. 

The intelligent agents in information retrieval arc now able to find and filter 
important and useful information according to their user preferences [6] [7]. 

Some intelligent agents are capable to cornpm-e and advise similarity of doc- 
uments in order to refine as much as possible search or recommendation web 
pages [8] [9]. 
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2.2 Agents for information and Knowledge Discovery 

We mean Discovery Agents the autonomous programs which process discovery 
of information from the Web. Wc do not use and generalize the term for binary 
and image database. Our area of discovery is the text or keywords. 

The process of information discovery could be greatly assisted if there was 
a way of making it personalized. For instance a personalized agent could form 
more exact queries. 

Current Information Discovery Agents specializing in WWW visit directly 
the site of interest and analyze the document to find if it has changed and how 
much, by using the database where the URLs are stored. New and unknown 
information sources will be progressively detected and added to the database. 

For example, “Altavista Discovery” [2] or “Karnak” [3] has a purpose to per- 
sonalize the information at the user level. “CiteSeer” [4] is focusing its discovery 
based on group of users or clustering. 

The Ideal Discovery Agents are able to navigate, read, summarize, and again 
surf the collection of text to form an abstraction of what new research or domain 
is all about. 

The methods used in discovery agents are same in data mining. Mining func- 
tions are based on statistics, classification, correlation-detection, factorial analy- 
sis methods, and graphical representation. For examples, text clustering is used 
to create thematic overview of text collections and co-citation analysis are de- 
veloped to find general topics within a collection. 

Our methods are related to “Text Data Mining” described in [10]. These 
include the exploration strategies and hypothetical sequence of operations. 

Until now, discovery agents do not suggest a possible new researeh topie yet. 
As the goal of our research is to discover new research topics from the web, we 
will describe in the following our methods and experiments. 



3 KAROKA Model and Architecture 

KAROKA objectives are to design agents that can process user queries in 
domain specific reseainh areas, collect WWW sources relevant to the queries, 
extract keywords and rules from the retrieved sources, infer to determine possible 
new research topics in the domain, and present results as list of possible new 
research topics' in the domain. KAROKA (see figure 1) uses keywords extracted 
from technical and project presentation articles available in the World-Wide- Web 
documents. 

KAROKA does not search the WWW itself but instead launches multiple 
agents that utilize existing indexing engines and i)erform a ”ineta-sear(4i” in 
order to collect and discover information that is broadly of interest to the user. 
Then the system further analyzes the retrieved documents in building keywords 
database and structured tree such as indexes to the research topics domain. 

It uses information discovery agents to monitor frequently changing informa- 
tion resources and update the database. 

According to the reseai'ch topic area, it generates and combines randomly 
keywords of research sub-topics. Strategies and constraints for the new topics 
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Fig. 1. KAROKA model and analysis : User focuses to the research topics in found web 
pages by a search engine according to a query in a specific domain (in this figure, the 
domain is in Computer Science /Artificial Intellicenge/ Machine Learning /Knowledge 
Discovery and Data mining). Documents arc retrieved from URL and are structured to 
research topics tree. Research topics are composed of one or several keywords. Keywords 
are classified into method keywords or application keywords. According to our strategies 
and algorithms, new research topics are derived 
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selection are based on classification techniques, association rules, and verification 
on the WWW. 



3.1 Preliminary Preparation 

Some research topics are already indexed and categorized hierarchically in the 
search engines via the Web. Those research topics are general and common. 
For example, the index of the research topics on Data Mining and Knowledge 
Discovery has many sub-topics. 

Computer Science>Artif icial Intelligence>Machine Learning 
Machine Learning>Knowledge Discovery>. . . 

Data Mining>Application> . . . 

Classif ication>. . . 

Feature Selection>. . . 

Classif ication>Decision Rules - Winnow - TFIDF - Naive Bayes - ... 



This preliminary preparation can be realized with an user agent. The follow- 
ing tasks should be done first. 

User uses search engine to find general ‘research topics’ to the research do- 
main hc/shc is interested. 

User selects some UKUs from the search results to start the discovery of new 
research topics. 

For each URL pages, there is a list of contents. User focuses to the content Ve- 
se.arch topics or research areas’ \i it exists. 

User retrieves all documents in found section and stores in his computer. If the 
document is in HTML format, then document is transformed to structured text, 
the headers are treated as a special type of keyword. 



3.2 Keywords Extraction 

Each retrieved document for every URL is processed one by one. Document can 
have different format such as text ascii, PDF or postscript. With a conversion 
tools, the document is always converted into ascii text before the keyword ex- 
traction. 

One obvious methods to extract keywords is to find the keywords as the au- 
thors defined in the document. Some documents such as articles or technical 
papers contain explicitly the keywords. These explicit keywords are treated with 
priority. 

Documents without explicit keywords are processed with the document rep- 
resentation (bag of words) [11]. In defining rules through a training, we can 
extract the important words in the bag of words. The rules concern to eliminate 
the words with low frequency and high weight. These words are too common. 
In HTML documents, the ‘header’ keyword and ‘body’ keywords are weighted 
differently. The ‘header’ of some URL may contain the ‘Research topics’. 
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3.3 Knobots Adviser 

The Knobots advisers are autonomous programs to collect results of queries with 
Web search engines. 

Each generated keyword by the association rules optimizer module is checked 
in the Web. The first criteria for the relevance is the occurrence and the rank of 
the related URL given by the search engine such as first 10 or 100 matches. 

The knobot then checks the URL to identify the quality of the page and 
collects possible technical papers or reports. 



3.4 Keywrords Association Rules Optimizer 

The collected keywords are processed to discover new topics. 

This module classifies the keywords as “method keywords” and “application 
keywords.” The “method keywords” are related to the research topics and their 
sub-topics generally already known in the research community. 

The “application keywords” are related to the other domains which the meth- 
ods are applied. 

First keywords selection: Each method keyword is checked with the Web for 
the occurrence, then they are sorted. 

The keywords with high frequency (occurrence) are eliminated. They are too 
common for the topics. 

Combining keywords: Remaining keywords from the first selection are com- 
bined two-by-two to obtain the trends of the research within these research topic 
keywords. 

We then check with the knobot adviser module the relevance of these generated 
combination of keywords in the Web. 



Second keywords selection: According to the relevance of the keywords in 
the Web, a second selection is necessary to eliminate again the high frequency 
keywords. The same method as in first selection is used here. 

At this stage, we observed the existence of research topics classes. The high 
frequency keywords class is belong to the known research topics. Low frequency 
keywords class may be new research topics or irrelevant keywords. 



Second keywords combination: We applied the first selection method to the 
application keywords. 

We added the application keywords to two combined keywords result of the pre- 
vious second keywords selection method. We then generated trends of research 
topics based on three keywords as two method keywords and one application 
keyword. A final check with the Web is realized to get the new research topics 
proposals. 
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4 Experiments 



Our URLs starting points are Laboratories and Universities in Europe, Asia 
or North America with engiish pages (for examples, ATT, IBM, Microsoft Re- 
search, CMU, UCI, ORST, ...). We focused on the research topics Data mining 
and Knowledge Discovery to each URL. 

KAROKA is then launched to extract research topics keywords and try to pro- 
pose the promising new research topics. 

Wc describe here different values of the experiment. 

200 extracted explicit keywords from articles and technical papers on '‘data 
mining o.nd knowledge d.iscovery. and 600 keywords from HTML documents 
are retained. 

Classification of keywords as “methods keywords” and “application keywords” , 
elimination of common keywords using the term frequency, selection of keywords 
using the knobots gave the results as 110 method keywords and 50 application 
keywords. 

Combination of methods keywords two-by-two to find new topics, internet rank- 
ing according to the occurrence and elimination gave 316 research topics. 
Combination of the methods keywords with application keywords and selection 
returned 13450 research topics. 

Top 200 of these research topics are proposed as final results. These are research 
topics resulted from the KAROKA experiment. 



cognitive impairment ,8330 expectation-maximization ,1220 
methodkeyword results combining classifiers , 190 experimental comparisons ,427 

concept learning ,3490 explanation-based learning , 1769 
attribute focusing ,37 datacube ,1467 feature selection ,4200 

back propagation ,5070 deductive learning ,170 indicator variables ,1107 

bayesian networks ,4750 density estimation ,4369 inductive learning ,4353 
bayesian statistics ,3460 dependence rules ,45 instance-based learning ,829 

c4.5,3810 duplicate elimination ,598 lattice traversal, 9 

categorical data, 6277 ensemble learning ,160 »»»»» cut here»»»» 

closure properties , 1424 exemplar-based learning , 122 



research topics results (application keyword+two method keywords) 



»»»»» cut here>»»»»»»» 

astrophysics and lazy learning and tree-structured classifiers 20 
hiv and deductive learning and exemplar-based learning 20 
hiv and optimal classification and rule consistency 20 
mutations and concept learning and maximal hypergraph cliques 20 
pharmacy and lattice traversal and model uncertainty 20 
astrophysics and dependence rules and multi-strategy learning 20 
episodes and lattice traversal and statistical learning theory 20 
mutations and experimental comparisons and tree-structured classifiers 20 
episodes and first order decision trees and rule-based systems 20 
mutations and lattice traversal and model uncertainty 20 
episodes and duplicate elimination and lattice traversal 20 
pharmacy and density estimation and vedimension 20 
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astrophysics and attribute focusing and lazy learning 20 
hiv and categorical data and first order decision trees 20 
query and first order decision trees and rule learning 20 
pharmacy and perceptron and vcdimension 20 
pharmacy and lattice traversal and rule induction 20 
query and curse -of -dimensionality and vcdimension 20 
episodes and vcdimension and winnow 20 
astrophysics and lattice traversal and rule induction 20 
drug resistance and lazy learning and montecarlo methods 19 
dementia and lattice traversal and statistical reasoning 19 
»»»»» cut here»»»»»»»> 



Our first experimental results show the new research topics. We are sure that 
these topics are new and rare in the WWW. User should conclude to their utility 
and undcrstandability. 



5 Conclusions and Future Works 

In this paper, we presented a model for research topics discovery from the Infor- 
mation World Wide Web. The model is based on KAROKA system. KAROKA is 
a personalized tool using keywords association rules and knobots. With KAROKA, 
we have partially automated the discovery. 

Our experiment results show the KAROKA system applied to discover new 
research topics on Data Mining and Knowledge Discoveuj. 

At the stage of the KAROKA program, the user must interpret the result 
given by KAROKA as a support for his/her research topics finding. 

In the future, we are refining the KAEOKA program to be more precise and 
flexible for any language. 
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Abstract. In the paper we investigate the problem of analysis of time 
related information systems. We introduce notion of temporal templates, 
i.e. homogeneous patterns occurring in some periods. We show how to 
generate temporal templates and time dependencies among them. We 
also consider decision rules describing dependencies between some tem- 
poral features of objects. Finally, we present how temporal templates can 
be used to discover behaviour of such temporal features. 

Keywords: temporal template, temporal feature extraction, rough sets 

1 Introduction 

The intelligent analysis of data sets describing real life problems becomes a very 
important topic of current research in computer science. Different kind of data 
sets, as well as different types of problems they describe, cause that there is 
no universal methodology nor algorithm to solve these problems. For example, 
analysis of a given data set (information system) may be completely different, if 
we define a time order on a set of objects described by this data set, because the 
problem may be redefined to include time dependencies. Also, the expectation 
of an analyst may be different for the same data set, according to the situation. 
Let us consider a decision problem described by an information system (decision 
table), where objects are ordered according to time. In one situation the analyst 
may want to extract typical decision rules, e.g. a = v and c = w then 
decision = but another time, information about how the change of given 
condition (attribute) influences change of decision, e.g. ’T/ Aa=high positive 
and Ac=high negative then Adecision = neutraF . Much more general problem 
is to find, for a given property of condition (Aa = positive is an example of 
property positive change of condition a), temporal dependencies giving the idea 
about periods of occurrences of this property, as well as temporal dependencies 
between different properties. For example, we can discover, that, if properties 
”Z\a = positivd^ and Ac=negativF appear together in the same time and last 
for some long period, it means, that after a certain time another set of properties 
will appear together (e.g. properties = neutraF and ”Z\c = positivF^ ). 

In the paper we introduce the notion of temporal template, i.e. homogeneous 
pattern occurring in time. We show how to extract several different temporal 
templates and how to discover dependencies among them. We also present a 
method of generation of more general decision rules, which contain information 
about, e.g changes of values in some period. We show how temporal templates 
can be used to track temporal behaviour of attribute properties. 

W. Ziarko and Y. Yao (Eds.): RSCTC 2000, LNAI 2005, pp. 420-427, 2001. 
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2 Temporal templates 

First, let us define the way we represent data sets. The basic notion is an infor- 
mation system. [8] which is defined as a pair A = (f/, A), where is a non-empty, 
finite set called the universe and A is a non-empty, finite set of attributes. Each 
a G A corresponds to function a : U ^ Vaj where Va is called the value set of a. 
Elements of U are called objects^ situations or cores, interpreted as, e.g., cases, 
states, patients, observations. 

A special case of information system is a decision table A = (?7, A U {d}), 
where d ^ A is a distinguished attribute called decision. The elements of A are 
called conditional attributes (conditions). 

Any expression of the form (a G V)^ where oG AU{d},y C Va we call a 
descriptor. A descriptor (a G V^) is a conditional descriptor iff a G A, or decision 
descriptor iff a G {d}. 

We say that an object x £ U m.atches a descriptor (a G V) iff a{x) G V . 

The notion of templates was intensively studied in literature (see e.g. [1], [5], 
[7]). For a given information system A = (f/. A) by generalized template we mean 
a set of descriptors 

T={{aeV):V CVa} (1) 

such, that, if (a G V^) G T and (b G W) G T then we have a ^ b. An object 
X £ U m.atches a generalized template d\ if it matches all descriptors of T. A 
special case of generalized templates are templates with one-value descriptors, 
i.e. of form (a = c). Templates can be understood as patterns which determine 
homogeneous subsets of information system 

T(A) ={xeU: y^aev)er <0 e K} (2) 

In many practical problems a very important role plays time domain. In this 
case, the set of objects is ordered according to time A = ({xi,X 2 , ...,x^}. A), 
and we say, that t is the time of occurrence of object 1 <t <n. From now on 
by information system we mean one in a sense presented above. In such systems 
one can consider templates that occur in some period. This kind of templates 
we call temporal templates and define as 

T = (T,ts,te), l<ts<te<n (3) 

We say, that two temporal templates Ti = (TiVsVe) and T 2 = (T^Vs^^e) 
are equal if Ti = T 2 . 

Now, let us define two important properties of temporal templates. By width 
of temporal template T = (d\tsVe) we mean widthfT) = tg — tg + 1, i.e. 
the length of the period which T occurs in. Please notice, that not all objects 
Xt^jXt^Pij have to match T and intuitively the more objects match T the 

better. Thus, by support of T we understand supp{T) = card(T( At)), where 
At = ^he information system determined by T and 7 '(At) 

is set of objects of this system that match T. The quality of temporal template 
is a function that maximizes width, support as well as number and precision of 
descriptors. 

In Figure 1 we show two examples of temporal templates Ti = ({(a G 
{«}), (c e {v})}, 2, 8) and Ta = ({(& G {x}), {d G {?/})}, 10, 13). 
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Fig. 1. An example of temporal templates. 

3 Temporal templates generation 

In this section we present an algorithm that generates several temporal templates 
which are disjoint according to time. The main idea is based on scanning the 
information system with some time window and generation of best generalized 
template within this window. There is a chance, that after a light shift of a time 
window, the previously found template is also the best one in a new window. By 
shifting the window we may discover the beginning (xt ^ ) and the end (xt ^ ) of the 
area where a given template is optimal or close to the optimal one. Shifting the 
time window through the whole information system generates a set of temporal 
templates. 

Let A = ({xi, X 2 , ..., x^}, A) be an information system. By time window 
on A of size s in point t we understand an information system = 

({xt,Xi^i, ..., Xi^s_i}, A), where 1 < t and t s — 1 < n. In the process of 
temporal rules generation both size s of a window and number of used windows 
are parameters that have to be tuned according to the type of data. Below we 
present details of the algorithm. 

Algorithm: Temporal templates generation 
Input: Information system A, size of time window size^ length of the shift of 

time window step, quality threshold r 
Output: Set of temporal templates 

1. i := 1,T = NULL^ts = 1 

2. while i < n — size begin 

3. best = Find Be stTem.pl at e{win/^^^{i)) 

4. if best 7 ^ T then begin 

5. te. = i 

6 . if Quality {(TBsBe)) ^ then 

7. output {TBs Be) 

8. ts = i 

9. T = best 

10. end if 

11 . i := i step 

12 . end while 
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Fig. 2. A series of temporal templates. 
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At each run of the loop the algorithm is shifting the time window. The task 
of subroutine FindBestTemplate(A) is to return the optimal (or semi-optimal) 
template for A. Because the problem of optimal template generation is NP- 
hard (see e.g. [5]) some approximation algorithm to be used in this procedure 
should be considered. In [5], [7] there can be found several very fast heuristics 
for templates generation. 

The input parameter r is used to filter out temporal templates of low qua- 
lity. The definition of template quality may be different and it depends on the 
formulation of a problem. 



4 Discovery of dependencies between temporal templates 

Generation of series of temporal templates defines a very interesting problem 
related to discovery of dependencies among them. Several templates may occur 
many times and it can happen, that one template is always (or almost always) 
followed by another one. Generally one set of templates may imply occurrence 
of other templates and extraction of such information may be very useful. Let 
us consider the situation, that we are observing occurrence of some temporal 
template in current data. On the basis of template dependencies we predict 
occurrence of another template in the near future. In this section we present a 
simple method that allows to discover such template dependencies. 

Let us observe, that the process of temporal templates generation results with 
a series of templates. On time axis each template has its time of occurrence and 
width. In Figure 2 we have an example series of four temporal templates A, B, 
C, D. From such a series we want to extract three kinds of dependencies. First, 
the dependencies between templates, e.g. decision rules of form ’G/ template in 
time t-3 is B and template in time t-1 is A then template in time t is D” . 
Second, if we have such a rule then on the basis of widths of templates B and A 
we want to predict the width of D. Third kind of information, we need, is the 
time of occurrence (0) of template D. 

Now, let us focus on the problem of decision rules generation, that map 
dependencies between templates. Templates can be treated as events and the 
problem may be reformulated in terms of frequent episodes detection in time 
series of events. This problem is investigated in e.g. [4], however, the difference 
is, that here events are not points on time axis (as in [4]), but intervals of different 
length. Thus, we propose another method which takes advantage of rough set 
theory. 

The idea is based on construction of a decision table from time series of 
templates and further computation of decision rules. The number of condition 
attributes n is a parameter of this method and it reflects how many steps in 
past we want to consider. The objects of this table we construct from series of 
templates - one object is a consecutive sequence of n 1 template labels. For 







424 P. Synak 



example, if our template series is one presented in Figure 2, i.e. A, B, C, A, B, 
D, A, B, D and n = 2 we obtain the decision table as in Table 1 . 
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Table 1. Decision table constructed from a sequence of temporal templates 

From this table we can generate decision rules using, e.g. rough set methods 
(see [2], [3], [ 8 ]). In our example we obtain, among other, the following rules: 
if Tt-i = A then Tt = B 
if Tt-2 = B then Tt = A 
if Tt-2 = A and Tt-i = B then Tt = D 

Once we have a set of decision rules we can compute, for each rule, how wid- 
ths of predecessor templates of the rule determine width of successor template. 
For a given rule, we construct a new decision table with condition attributes 
responding to widths of predecessor templates of the rule and decision attribute 
from width of successor template. As condition attributes we also consider length 
of gaps on time axis between consecutive templates. The set of objects we create 
on the basis of all objects of input decision table matching the rule. 

Suppose, we consider decision rule if Tt -2 = A and Tt-i = B then Tt = D. 
There are two objects X 4 , X 7 matching this rule. We check widths and gaps 
between templates represented by these objects and create a decision table as 
in Table 2a. It is obvious, that before further processing of this table, it should 
be scaled to contain more general values (Table 2b). From such a table we can 
compute decision rules expressing widths of templates. 




Table 2. Decision tables describing gaps between and widths of templates generated 
for sample decision rule. 

If we already have the information about which template is going to appear 
next and what is its expected width, what we still need to know is the estimated 
time of its appearance. This information we can generate in an analogous way as 
in case of width. The difference is, that, when constructing decision table, we take 
as decision attribute not the template, which is in the successor of the rule, but 
the length of the gap between this template and last predecessor template (see 
Table 2cd). Finally, we compute rules expressing time of template occurrence. 
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5 Temporal features and decision rules 

In this section we investigate the problem of decision rules generation that con- 
tain new features describing behaviour of objects in time. We assume, that the 
input data are numerical, so we can say about the degree of change of attribute 
value. An example of such a rule is ” if change of attribute a is positive and change 
of attribute c is negative then change of decision is negativd^ . Decision rules of 
this kind are very useful for analysis of time related information systems. We also 
show, how to compute templates built from descriptors of form (Aa = positive)^ 
that can be further used to more general association rules generation. 

In our method, first, we construct an information system which is a result of 
scanning of input data with a time window of some size. The size of a window 
is the parameter and can be tuned up according to type of data. When looking 
at the data through a time window we observe a history of changes of attribute 
values within some period (which is related to the size of the window). On 
the basis of this history we can construct new features describing behaviour 
of attributes in time, e.g. characteristics of plots of values, information about 
trends of changes. In our method let us focus on one example of such a temporal 
feature, which is the change of value within a time window. 

Let A = ({xi, X 2 , ..., x^}, {( 2 i, ..., a^n}) be an information system and let us 
consider time windows of size s on A. We construct a new information system 
As = {{yi,y 2 , {Aai, ...,Z\a^}) in the following way: 

— (2^(Xj-^s_l) (2^(Xj). (4) 

In Table 3a we have a sample information system, which after scanning with 
a time window of size s = 3 is as one in Table 3b. 




Table 3. (a) Sample information system, (b) after scanning with a time window of 
size 3, (c) scaled, (d) scaled and containing new attributes (levels). 
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The obtained information system should be scaled next, so the results of 
analysis could be more general. In our example we use three- value scale, which 
describes the degree of change - negative^ neutral or positive (see Table 3 c). 
Obtained information system is a base for our further analysis. 

First, let us consider templates computed for such system (see e.g. [ 7 ], [ 5 ]). 
They are built from descriptors of form (Aa = v) (or (Aa G {vi, ...}) in more 
general case) and long templates with large support may contain very useful 
knowledge about higher-order dependencies between attributes. These templates 
can be used to generation of approximate association rules (see [6]) built from 
higher-order descriptors. An example of templates for Table 3 c is 
Ti = {{Aai = neutral)^ {Aa^ = positive)} 

T2 = {(Aa2 = neutral)^ (Ad = negative)} 

Now, suppose the decision attribute is defined and we consider a decision 
table A = ({xi,X2, ...,x^}, {ai, ..., U {d}) describing some decision problem. 
Using the method described above we can generate a decision table built from 
information about attribute changes (including decision), which we can compute 
decision rules for. Obtained rules contain knowledge about how value changes of 
conditional attributes determine change of decision values. 

For example, if we consider the information system presented in Table 3 a as 
decision table, with last attribute being a decision, after processing, we obtain 
Table 3 c. From this table we can extract, e.g., the following rules: 
if Ao2 = negative then Ad = positive 
if Aai = neutral and Aa2 = neutral then Ad = negative 
Another extension of this method is to include, in the final decision table, 
more conditional attributes describing, e.g. levels of attribute values. It can hap- 
pen, that the degree of change of decision attribute depends not only on degrees 
of conditional attributes changes, but also on the levels of values. In Table 3 d 
we have an example of such attributes which are scaled into three classes - low^ 
medium^ high. From this table we extract, among other, the following rules: 
if Ao2 = neutral and 02 = low then Ad = negative 
if Aa2 = negative and 02 = high then Ad = positive 
The first rule, for example, can be interpreted as ” if value of C02 isnh changing 
much (in some period) vjhen it is low^ then the value of decision is decreasing'^h 

6 Behaviour of temporal features in time 

In the previous section we showed how new features, that express behaviour of 
attributes in a time window, can be extracted from an information system. There 
can be several features considered - as an example we took a difference between 
value at the end and the beginning of time window. Now, let us investigate the 
problem of temporal behaviour discovery of a group of features. 

Suppose, we consider a feature grows fast” discovered from time related 
information system. There can be considered several different problems related 
to this feature. The basic one is to discover periods that this feature holds in. 
Besides, we would like to know what are other properties (e.g. ”0.3 behaves 
stably” ) that hold at the same time. Another question is what are the symptoms 
that this feature is about to finish and what features are going to appear next. 
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We believe, that temporal templates generation is a tool which helps to an- 
swer above questions. First, we analyze the information system using time win- 
dow method and construct new system built from temporal features. Then, we 
compute temporal templates and generate dependencies among them. One can 
notice, that temporal template, for so processed information system, contains 
information about a set of features that hold at the same period, as well as 
about beginning and end time of this period. Analysis of dependencies between 
temporal templates gives the idea about what new set of features may appear 
after a current one. 

7 Summary 

We claim that the notion of temporal templates can be very useful for analy- 
sis of time related information systems. Investigation of dependencies between 
temporal templates gives the idea how the knowledge hidden in data is chan- 
ging in time. Very important topic is the extraction of new features from data, 
describing temporal properties of data. Finally, temporal templates can be used 
to check how these features behave in time. 
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Abstract. We present a modification of a simple incremental procedure 
maintaining the set of all current reduct rules. It reduces searching to 
the part of the rule space limited by a dynamic monotonic constraint. 
Efficiency problems and their solutions for the class of coverage based 
constraints are discussed and an illustrative example is provided. 
Keywords: rough sets, machine learning, incremental learning, decision 
algorithms. 



1 Introduction 

In recent years rough sets were intensively studied as a method for approximative 
concept synthesis from data tables. Many data sources have dynamic character 
and their size is still increasing. In order to maintain the validity of knowledge ex- 
tracted from dynamically changing data one should develop incremental learning 
strategies. 

Incremental learning has been already widely studied in machine learning 
and for the exhaustive overview of these methods the reader is referred e.g. 
to [1] and [3]. This paper examines the problematics on the ground of rough sets 
introduced by Pawlak [5]. Different incremental algorithms maintaining reducts 
were proposed e.g. [4], [7] and experimental results comparing nonicremental and 
incremental methods for reduct generation may be found in [9]. 

The subject of the paper is an incremental method maintaining a set of reduct 
rules. Shan and Ziarko [7] described an algorithm generating all reduct rules. 
This paper presents its more practical version based on the notion of dynamic 
monotonic constraint that reduced the size of the rule space to be searched. The 
idea of searching for rules satisfing user requirements has been already used in 
nonincremental approach e.g. [2], [8]. Different properties and accelerating meth- 
ods of the proposed solution are described and an experimental example that 
demonstrates potential advantages of the constraint based approach is provided. 

2 Classification Rules and Constraints 

We denote a finite set of binary attributes by A and a finite set of decisions by V. 
The domain of all objects is defined by U = {0, 1}"^. The input of an incremental 
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algorithm is a finite sequence of pairs di) called a sample^ where Uj ^ U is an 
object and G V is a decision for Ui. The notion of a sample corresponds to the 
notion of a decision table [5] in nonincremental approach. For a given sample s 
we denote the set of all examples from s with a decision d by Class s{d). 

A classification rule is an implication a ^ d where a is a conjunction of 
literals of attributes from A and d ^ V. The support of a sample s for a con- 
junction a is defined by [a]^ — {{u^d) ^ s : u satisfies a} and for a rule a ^ d 
is defined by [a = [a]^ fl Class s{d). 

A rule a d is certain for a sample s if for each pair d^) in s such that 
Ui satisfies a the decisions are equal di = d. A certain rule a d is a reduct 
rule if a is a minimal conjunction in the sense of literal set inclusion among 
all conjunctions occurring on the lefthand side of a certain rule with the same 
decision d. The set of all reduct rules with the decision d for a sample s is denoted 
by RedRuls{d). We use two measures for rules: confidence and coverage [2], [5], 
[ 6 ], [ 8 ]: 



f 0 if Ms = 0 

confidences{a ^ d) = { ||[a=^<i]J| .j. ^ n ^ « 

I llw.ll ^ ^ 

^ = WClassMW 

Usually the set of all reduct rules is very large and only a small subset, that 
can be described by a monotonic constraint, is relevant. A monotonic constraint 
is a set of rules C such that if a rule a ^ d belongs to C then for each B C 
Literals (a) the rule /\B ^ d also belongs to C. We restrict the space of reduct 
rules to bounded by C: RedRul^{d) — C r\ RedRuls{d). Throughout the paper, 
somewhat informally, we denote the description of a monotonic constraint and 
the set of rules defined by the monotonic constraint with the same symbol (7. 

In the next sections we focus our attention on two types of a monotonic 
constraint: the first one RedRul^^'^^'^^^^^^ {d) bases on a fixed coverage threshold 
0 G [0, 1]: 



{r G RedRulg{d) : coverages (r) > 0} 

and the other one RedRuff^^~^ {d) includes always the set of exactly k best 
reduct rules: 

{r G RedRuls{d) : \\{r' G RedRuls{d) : coverage s{r') > coverage g{r)}\\ < k} 

3 Incremental Constraint Based Algorithm 

The algorithm [7] computing all reduct rules starts with the set of the most 
general rules one for each decision class and after each new example is added it 
extends each rule that is inconsistent with the example by adding the literals 
excluding the example. 
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Since the space of rules is usually too large for searching for all reduct rules, 
we propose a modified version of the incremental algorithm using a dynamic 
monotonic constraint that may change after each new example is added. The 
algorithm limits the set of maintained reduct rules to rules satisfying the con- 
straint. Let C denote the considered monotonic constraint. During computation 
the algorithm always maintains the following sets: s — the set of training exam- 
ples, Rules (d) — the set of reduct rules with the decision d, CCand{d) — the 
set of candidates for reduct rules with the decision d satisfying the constraint C 
and nonCCand{d) — the set of candidates for reduct rules with the decision d 
not satisfying (7. 

Like in [7] the algorithm starts with the set of the most general rules one for 
each decision class and for each new example it executes procedure learn. The 
difference is that the constraint based algorithm extends candidates only from 
the sets CCand set leaving the sets nonCCand unchanged: 

Algorithm 1 learn(u,d) 
s := s -h ('W,d); 
update the constraint C ; 
for each d' do 
step 1: 

move all rules r G Rules {d') such that r ^ C to nonCCand{d'); 
move all rules r G Rules{d') inconsistent with {u^d) to CCand{d'); 
move all certain rules r G nonCCand{d') such that r ^ C to Rules{d'); 
move all rules r G nonCCand{d') such that r ^ C to CCand{d'); 
step 2: 

while CCand{d') do 

remove an arbitrary rule a ^ d' from CCand{d'); 
find an example {u"^d") inconsistent with the rule a d^ ; 

for each attribute a G A \ Attributes{a) do 
I '.—literal for a which excludes u” ; 
if a Al ^ d' is not subsumed 

by another rule from Rules{d') U CCand{d') U nonCCand{d') then 
if a Al ^ d' ^ C then nonCCand{d') := nonCCand{d') U {a Al ^ d'} 
else if a Al ^ d' is certain then Rules{d') := Rules{d') U {a A / d'} 

else CCand{d') := CCand{d') U {a A I ^ d'}; 

At the beginning of the procedure learn{u^ d) the sets Rules{d') are assumed 
to contain all reduct rules satisfying the constraint C and nonCCand{d') are 
assumed to contain all generated up to now rules not satisfying (7, both according 
to a sample s before adding a new example (u,d). The sets CCand{d') should 
be empty. 

In the step 1 the procedure moves rules according to changes in the sample 
s and the constraint (7: reduct rules for a previous sample may be inconsistent 
with a new example {u^d) and the modified constraint may both include new 
candidate and reduct rules and exclude previously covered reduct rules. Time 
needed for this step may vary significantly in dependence on a used constraint. 
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For the constraint coverage > 0 migration only for rules that cover a new object 
u is possible, for constraints with positive coverage threshold other rules with 
the decision d can migrate and for best — k constraints checking constraint sat- 
isfiability becomes much more complex. In the last case a good solution is to 
assume the ranking based on the current set of reduct rules and do the step 2 
correcting the ranking every time when a new reduct rule is found. 

In the step 2 the procedure extends all candidates satisfing C. Candidates 
that were previously in nonCCand{d') may be inconsistent with any example in 
the sample s, not always with the last one (d, u). Therefore the procedure must 
search the sample s for an inconsistent example. In order to avoid searching the 
whole sample for each candidate the procedure may assign to each extended rule 
a Al ^ d' the position in s where an inconsistent example for the previous rule 
a ^ d' was found and continue searching from this place. After an inconsistent 
example is found, the candidate is extended with all literals excluding 

u". The next time consuming operation is subsumption checking. If an extension 
is not subsumed by another rule it is directed to the appropriate set, otherwise 
it is removed. 

Theorem 1. At the end of the procedure learn the union [j^^y Rules{d) is 
always equal to the set of all reduct rules satisfying the constraint C for the 
sample s. 

4 Improving Efficiency 

One of the properties of the algorithm presented in the previous section is that it 
never reduces rules. Generating more and more new rules without any reduction 
prolongs checking for subsumptions and leads to the lack of memory. In order to 
avoid the problem the following solution may be used. Every time after a rule 
a ^ d' is added to the set nonCCand{d') it is also reduced as much as it is 
possible: 

Algorithm 2 reduce (a ^ d') 
reduce the rule a ^ d' to (3 ^ d' 

where (3 is any minimal conjunction subsuming a such that (3 ^ d' ; 

The presented improvement applies to constraints that have "shrinking” 
property what means that new examples may lead to excluding a rule from 
a constraint. An example of a "shrinking” constraint is coverage > 9 for any 
0 > 0, whereas the constraint coverage > 0 does not have this property. 

However, this modification brings another undesirable phenomenon affecting 
efficiency namely "shimmering” of rules what means that a single rule may be 
generated and reduced many times while the constraint is changing dynamically 
and repeated computation of rule parameters significantly slowers the perfor- 
mance. We present two methods to deal with this problem. 

The first one consists in maintaining two buffers: BufExt saves rules for 
which the extending operation was already performed and BufRed saves rules 




432 A. Wojna 



that were reduced. The buffers are usually too limited for keeping all rules that 
appeared in the process of learning. Therefore a certain measure is applied to 
estimate which rules are the most probable to be reused in the near future. 
For coverage based constraints coverage is a good measure for it. The following 
procedure save Extended is executed each time when a rule is extended: 

Algorithm 3 saveExt ended (r) 

if the buffer BufExt is not full then add r to BufExt; 
else if cover ages {r) < vaaxr' ^BufExt cover age g{r') 

then replace a rule with the maximal coverage in BufExt with r ; 

The analogical procedure saveReduced is executed when a rule is reduced: 
Algorithm 4 saveReduced(r) 

if the buffer BufRed is not full then add r to BufRed; 
else if coveragcsir) > mirir^ ^BufRed coverages {r') 

then replace a rule with the minimal coverage in BufRed with r ; 

When the procedure learn needs to compute parameters for a new generated 
or reduced rule first it checks whether the rule is still available in the correspond- 
ing buffer. 

Another solution that reduces "shimmering” is grouping examples. Instead 
of learning each new example separately first the algorithm gets a large group 
of examples and then starts learning rules. The learning process for a group of 
examples may last much longer than for a single example. However, notice that 
the procedure learn may be easily split into two parts: the first one corrects 
the contents of the maintained sets and the parameters of rules according to the 
sample including a new group and the next one generates new rules. The first part 
is always short hence the second one is critical for time performance. Therefore 
a good assumption for the second part is to be ready to stop learning and 
classify a new object with a current set of rules every time when the classification 
procedure is called. It requires from the algorithm to use rules with confidence 
less than 1 for classification. In this proposition a strategy of choosing rules for 
extension is important. The higher confidence a rule saves after updating by a 
new group of examples the more reliable it is for the classification procedure. 
Therefore a good strategy is to start extending with a rule having confidence 
nearest to 1 and move towards rules with lower confidence. In this way more 
reliable rules are adapted to a new group first. The latter solution provides also 
a good background for distributed computation. 

The presented algorithm may be also adapted to the case when it is given a 
very large set of examples s at once. Like in the incremental algorithm it executes 
the procedure learn for successive examples in s. Because of the size the compu- 
tation for the whole sample would last very long and would block classification 
procedure calls. To avoid it the learning procedure is always stopped when the 
classification procedure is called and waits until the classification is completed. 
Classification uses a current set of computed rules. Many of them may be still 
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inconsistent with a number of examples, therefore before classification the algo- 
rithm needs to calculate qualitative parameters of rules: confidence and coverage, 
according to the whole sample s. It imposes the additional condition that a used 
classifier accepts rules with confidence less than 1. 

Computing parameters for a set of rules consumes much less time than gener- 
ating this set but computing them every time when the classification procedure 
is called is usually still too expensive for a large set of rules and a large set of 
objects. In order to avoid the problem the algorithm may perform the following 
operations. For a particular object to classify it may compute parameters only 
for rules covering the object. Once computed parameters for a rule may be pre- 
served as long as the rule is held in the corresponding union Rules (d) U Cand{d). 
Independently of classification procedure calls the learning procedure may stop 
at regular intervals and compute parameters for rules generated since the previ- 
ous stop. The choice of appropriate data structures may significantly accelerate 
computation of parameters for rules and objects. 

In case when all methods of improving efficiency fail, the exhaustive search 
may be replaced immediately by any heuristic search. 



5 Illustrative Example 

We present experimental results for the data set Income (13 attributes, 30162 
training cases, 15060 testing cases) from the repository at University of Cali- 
fornia, Irvine (http://kdd.ics.uci.edu). In preprocessing discretization was used 
and 32 binary attributes were chosen by greedy heuristic algorithm optimizing 
discernibility. 

The learning procedure was executed for groups of examples. We used the 
incremental constraint based algorithm with the modification that rules were 
extended not in all possible directions but only with rules that have the best 
confidence and the best coverage if there are ties in confidence for at least one 
covered object. 

For each coverage threshold 0, 0.05 and 0.2 we performed series of computa- 
tion in the following way. First the procedure learn was executed for the first 
1 /8 part of the training set and the testing set was classified. In each next step 
the number of examples equal to the number of examples received in all previous 
steps was added and learned and the next test of the testing set was performed. 
In this way the size of successive groups of examples grew exponentially. Each 
test object was classified with the decision of the best covering rule in the union 
[Jd^Y Rules {d) U nonCCand{d) according to the confidence and in case of ties 
to the coverage. 

The results are presented on the graphs below. Left side graphs present the 
classification error, time and number of rules obtained in three series of in- 
cremental learning with different constraints: coverage > 0.2 (light line with 
boxes), coverage > 0.05 (medium dark line with circles) and coverage > 0 
(dark line with diamonds). Right side graphs present the final results of in- 
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cremental (medium dark line with crosses) and nonincrement al (dark line with 
circles) learning for different coverage based constraints. 





Income - classification error (%) Income - classification error (%) 





Income - time (sec) 





In the presented example the application of stronger constraints brought a 
significant reduction of used memory and time and very small deterioration of 
accuracy or even improvement for low coverage thresholds. The results show 
also that accuracy obtained with the small part of the training set used in a 
learning process is not significantly lower than for the whole training set and 
finally incremental learning reached better results than nonincremental. Similar 
properties of test results on other data sets (Shuttle, Letter) indicate that the 
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combination of the incremental approach and a coverage based constraint may 
be an effective tool for learning concepts from both dynamic and large data sets. 

6 Conclusions 

We have shown how rough set methods can be adapted to dynamically changing 
data. We proposed a method based on a special type of monotonic constraints 
that allowed us to reduce searching in the space of rules without substantial 
changes in the classification quality. The presented method may be adapted to 
large data sets especially when one implements it using cluster of computers. The 
experimental example indicates that the incremental approach may preserve all 
advantages of nonincremental methods and add new ones like reduction in used 
time and memory and continuous improvement. 

The following related problems are the subject for future study: methods 
for coding arbitrary attributes by binary ones e.g. by discretization or value 
grouping and efficient methods for computing confidence and coverage for large 
rule sets because this is the most time consuming operation. 
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Abstract. This paper presents a hybrid model for rnle discovery in real 
world data with nncertainty and incompleteness. The hybrid model is 
created by introdncing an appropriate relationship between dednctive 
reasoning and stochastic process, and extending the relationship so as 
to inclnde abdnction. Fnrthermore, a Generalization Distribntion Table 
(GDT), which is a variant of transition matrix in stochastic process, is 
dehned. Thns, the typical methods of symbolic reasoning snch as de- 
dnction, indnction, and abdnction, as well as the methods based on soft 
compnting techniqnes snch as rongh sets, fnzzy sets, and grannlar com- 
pnting can be cooperatively nsed by taking the GDT and/or the transi- 
tion matrix in stochastic process as medinms. Ways for implementation 
of the hybrid model are also discnssed. 



1 Introduction 

In order to deal with the complexity of real world, we argue that an ideal rule 
discovery system should have such features as: 

— The use of background knowledge can be selected according to whether back- 
ground knowledge exists or not. 

That is, on the one hand, background knowledge can be used flexibly in 
the discovery process; on the other hand, if no background knowledge is 
available, it can also work. 

— Imperfect data can be handled effectively, and the accuracy affected by im- 
perfect data can be explicitly represented in the strength of the rule. 

— Biases can be flexibly selected and adjusted for constraint and search control. 

— Data change can be processed easily. 

Since the data in most databases are ever changed (e.g., the data are often 
added, deleted, or updated), a good method for real applications has to 
handle data change conveniently. 

— The discovery process can be performed in a distributed cooperative mode. 

It is clear that no method can contain all of the above performances. A 
unique method is to combine several techniques together to construct a hybrid 
approach. We argue that the hybrid approach is an important way to deal with 
real world problems. Here “hybrid” means the way of combining many advan- 
tages of existing methods, and avoiding their disadvantages or weaknesses when 
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the existing methods are used separately. There are ongoing efforts to integrate 
logic (including non-classical logic), artificial neural networks, probabilistic and 
statistical reasoning, fuzzy set theory, rough set theory, genetic algorithm and 
other methodologies in the soft computing paradigm [1, 14, 9]. 

In this paper, a hybrid model is proposed for discovering tf-then rules in data 
in the environment with uncertainty and incompleteness. The central of the hy- 
brid model is the Generalization Distribution Table (GDT) that is a variant of 
transition matrix in stochastic process. We will also discuss ways for implemen- 
tation of the hybrid model. 

2 A Hybrid Intelligent Model 

2.1 An Overview 

In general, hybrid models involve a variety of different types of processes and 
representations in both learning and performance. Hence, multiple mechanisms 
interact in a complex way in most models. 

Figure 1 shows a hybrid intelligent model for discovering zf-then rules in data, 
which is created by introducing an appropriate relationship between deductive 
reasoning and stochastic process [9], and extending the relationship so as to 
include abduction [14]. Then a Generalization Distribution Table (GDT), which 
is a variant of transition matrix (TM) in stochastic process, is defined [14, 15]. 
Thus, the typical methods of symbolic reasoning such as deduction, induction, 
and abduction, as well as the methods based on soft computing techniques such 
as rough sets, fuzzy sets, and granular computing can be cooperatively used by 
taking the GDT and/or the transition matrix in stochastic process as mediums. 



(A, A -> B => B) 




(B, A->B =>A) (A, B=>A-> B) 



TM : Transition Matrix GDT: Generalization Distribution Table 

RS: Rough Sets NR: Networks Representation 

GrC: Granular Computing ILP: Inductive Logic Programming 



Fig. 1. A hybrid model for knowledge discovery 



The shadow parts in Figure 1 are the major parts of our current study. The 
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central idea of our methodology is to use the GDT as a hypothesis search space 
for generalization, in which the probabilistic relationships between concepts and 
instances over discrete domains are represented. By using the GDT as a proba- 
bilistic search space, (1) unseen instances can be considered in the rule discovery 
process and the uncertainty of a rule, including its ability to predict unseen in- 
stances, can be explicitly represented in the strength of the rule [15]; (2) biases 
can be flexibly selected for search control and background knowledge can be used 
as a bias to control the creation of a GDT and the rule discovery process [17]. 

Based on the GDT, we have developed or are developing three hybrid sys- 
tems. The hrst one is called GDT-NR, which is based on a network representation 
of the GDT; the second one is called GDT-RS, which is based on the combina- 
tion of GDT and Rough Set theory; the third one is called GDT-RS-ILP, which 
is based on the combination of GDT, Rough Set theory, and Inductive Logic 
Programming, for extending GDT-RS for relation learning. Furthermore, Gran- 
ular Gomputing (GrG) can be used as a preprocessing step to change granules of 
individual objects for dealing with continuous values and imperfect data [12, 13]. 

In Figure 1, we further distinguish two kinds of lines: the solid lines and the 
dotted lines. The relationships denoted by the solid lines will be described in 
this paper, but the ones denoted by the dotted lines will not. 

2.2 Deductive Reasoning vs. Stochastic Process 

Deductive reasoning can be analyzed ultimately into the repeated application of 
the strong syllogism: 

If A is true, then B is true 
A is true 
Hence, B is true. 

That is, (A A (A ^ B) => B) in short, where, A and B are logic formulae. Let 
us consider two predicates F and G, and let d be a hnite set. For simplicity, we 
assume that F and G are single place predicates. They give descriptions on an 
object in d. Or, in other words, F (the definition for G is the same as the one 
for F from now) classifies all elements in the set d into two classes: 

{x G d\ F(x) is true (or x satisfies F)}, and 
{x G d\ F(x) is false (or x does not satisfy F)}. 

In the following, F(x) and F(x) mean “T(a?) is true (or x satisfies F)” and 
“T(a?) is false (or x does not satisfy F)” respectively for x ^ d. Thus, one of the 
most useful forms of A ^ R can be denoted by multi-layer logic [8, 9] into 

That is, the multi-layer logic formula is read “for any X belonging to d, if F(X) 
is true then G(X) is also true”. Notice that since the set d can be looked upon 
as an ordered set, F is represented by a sequence of n binary digits. Thus, the 
ith binary digit is 1 or 0 for F(ai) or F(ai) corresponding to the ith element 
Qi of the set d. Based on the preparation, several basic concepts are described 
first for creating an appropriate relationship between deductive reasoning and 
stochastic process. 
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1. Expansion function. 

If the domain set d of formula F is finite, d — {ai , ct2, • • • , then the 
multi-layer logic formulae X/d\F(X) and [3 X/d\F(X) can be expanded 
as 

F(ai) A F(a 2 ) A ... A E(a^) (or Aa^£dF{ai) in short) and 
F(ai) V F(a 2 ) V ... V F(an) (or \/a,edF(ai) in short), respectively. 
These are called expansion functions of multi-layer logic formulae. 

An expansion function is used to extract from a set the elements that possess 
specified properties. Furthermore the multi-layer logic formulae 

[VX#/d](T(X) ^ G(X)) and 
[3 X#/d](F(X) ^ G(X)) 



can be expanded as 



^a,ed{F{ai) —f G{ai j) and Va,^d{F{ai) respectively. 



2. States of d with respect to F 

Let d = {ai, a2 , . . . , be a finite set. The states of d with respect to F are 
defined as the conjunctions of either F(ai) or F(ai) for every element in 
d. That is, the states of d corresponding to F are 
Si{d, F) : F{ai) A ^(02) A . . . AF{an-i) A F{an), 

S 2 {d, F) : F{ai) A F{a 2 ) A ... A i^(a»-i) ^ F{an), 

. . ., and 

52-((i, F) : F{ai) A F{a2) A ... A i^(a„_i) A i^(a„). 

Let Prior. S{d, F) = {Si{d, F),..., S 2 n{d, L’)}. Each S{{d, F) in Prior.S{d, F) 
is called a prior state of d with respect to F . 

For example, if d = {ai, 02, 03}, its possible prior and posterior states are 
Prior.S{d, F) = {E(ai) A F{a 2 ) A ^(03), F{ai) A F{a 2 ) A ^(03), . . . , 

F{ai) A F{a 2 ) A ^(03)}, 

Posterior.S{\i Xld\F{X)) = {E(ai) A ^(02) A ^(03)}, 

Posterior.S{[3 X/d\F{X)) = { T{ai)F A (02) A F{az), . . 

F{ai) A E(a2) A ^(03) }. 

Using binary digit 1 and 0 instead of F(ai) and F(ai), the above states can 
be expressed as follows: 

Prior.S{d, F) = {000, 001, 010, . . . , 111}, 

Posterior.S{[^ X/d\F(X)) = {111}, 

Posterior.S([3 X/d\F(X)) = {001, 010, . . . , 111}. 

3. Probability vector of state occurring. 

A probability vector with respect to X/d\F(X) is 

p[vx/d]F(x)^(0,0 ,...,l), and 

a probability vector with respect to [3 X/d]F(X) is 
p[3 x/d]f(A)^( 0 ,a,...,a), 

where, y a = 1. 
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Based on the basic concepts stated above, onr pnrpose is to create an eqniva- 
lent relationship between deductive reasoning and stochastic process [9]. In other 
words, (F A (F ^ G) ^ G) is explained by an equivalent stochastic process 
p[yx/d]F(x)ji _ p[yx/d]G(x) ^ where T is a transition matrix that is equivalent 
to (F ^ G). In order to create the equivalent relationship, the following three 
conditions must be satisfied for creating T : 

— The elements tij of T are the probability p{Sj{d^G)\Si{d^F)). 

— p{Sj{d^G)\Si{d^F)) must satisfy the truth table of implicative relation. 

— i - - I 

Here Si{d^F) denotes the ith prior state in Prior_S{d^ F) and Sj{d^G) denotes 
the jth prior state in Prior-S{d^ G). That is, since 

[V Xld\{F{X) ^ G(X)) (or [3 Xld\{F{X) ^ G(X))) 
is equivalent to 

Aa,^d{F{ai) G{ai)) (or ya,ed{F{ai) G{ai))) 

(i.e., by using the expansion function stated above). According to the truth table 
of implicative relation, if the value of F{ai) is known, to satisfy F{ai) Cr(a^), 
it must follow that F{ai) V G{ai) — 1. In other words, the creation of T must 
satisfy the condition: 

if F{ai) IS true^ G(ai) is true^ otherwise any value of G(ai) is correct. 

Table 1 shows an example of transition matrix corresponding to 
\i X/d\{F{X) G(A)) and d = {ai,a 2 ,ct 3 }. In Table 1, the states in the left 

column denote respectively 

000: F{ai) A F{a 2 ) A F{as) 

001: F(ai) A F(a2) A F(as) 

111: F(ai) A F(a 2 ) A F(as), 
and the states in the top row denote respectively 

000: G(cci) A G(cc2) A G(cc3) 

001: G(ctx) A G(ct2) A G(ct3) 

111: G(ai) A G(a2) A G(a3), 

and the elements tij of T denoted in the transition matrix are the probability 
distribution corresponding to X/d](F(X) G(X)) and the elements of T 

not displayed are all zero. Furthermore, since any background knowledge is not 
used to create the probability distribution shown in Table 1, the probabilities of 
states occurring are equiprobable. For example, if the states of F is {010}, to 
satisfy F(X) G(A), the possible states of G are {010,011, 110, 111}, and the 
probability of each of them is 1/4. 
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Table 1. A transition matrix equivalent to [V X j d\{F{X) — ^ G(X)) 
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2.3 Hypothesis Generation Based on the Transition Matrix 

Our purpose is to create a hybrid model, as shown in Figure 1, for rule discovery 
in data with uncertainty and incompleteness. For the purpose, we would like to 
discuss here a kind of weaker reasoning, or call weaker syllogisms: 

If A is true, then B is true 
B is true 

Hence, A becomes more plausible. 

That is, (Ba(A ^ B) A) in short. The evidence does not prove that A is true, 
but verification of one of its consequences does give us more confidence in A. This 
is a kind of plausible reasoning for hypothesis generation, which is called “abduc- 
tion” . In other words, from the observed fact B and known rule A ^ B ^ A can 
be guessed. That is, according to the transition matrix shown in Table 1, from 
each element x ^ d such that G(x) is true and the rule [V X/d\(F(X) G(X)), 

F(X) can be guessed. Thus, an appropriate relationship between deductive rea- 
soning and abductive reasoning is created by using the transition matrix as a 
medium as shown in Figure 1. 

2.4 Generalization Distribution Table (GDT) 

The central idea of our methodology is to use a variant of transition matrix, 
called a Generalization Distribution Table (GDT)^ as a hypothesis search space 
for generalization, in which the probabilistic relationships between concepts and 
instances over discrete domains are represented [14]. Thus, the representation 
of the original transition matrix introduced in Section 2.2 must be modified 
appropriately and some concepts must be described for our purpose. 

A GDT is defined as consisting of three components. The first one is possible 
instances^ which are all possible combinations of attribute values in a database. 
They are denoted in the top row of a GDT. The second one is possible generaliza- 
tions for instances, which are all possible cases of generalization for all possible 
instances. They are denoted in the left column of a GDT. , which specifies a 
wild card, denotes the generalization for instances. For example, the generaliza- 
tion *6oCo means the attribute a is unimportant for describing a concept. 
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The third component of the GDT is probabilistic relationships between the 
possible instances and the possible generalizations, which are represented in the 
elements Gij of a GDT. They are the probabilistic distribution for describing the 
strength of the relationship between every possible instance and every possible 
generalization. The prior distribution is equiprobable, if any prior background 
knowledge is not used. Thus, it is defined by the Eq. (1), and = 1: 

j 



Gij ^p{PIj\PGi) 



^ NPG, em- 

. 0 otherwise 



( 1 ) 



where Plj is the jth possible instance, PGi is the ith possible generalization, 
and NpG^ is the number of the possible instances satisfying the ith possible 
generalization, that is, 

^PG, = W Pk ( 2 ) 

PG[;]=*} 

where PGi\(\ is the value of the kth attribute in the possible generalization PGi, 
PG[l] = * means that PGi doesnfi contain attribute L 

Thus, in our approach, the basic process of hypothesis generation is to gen- 
eralize the instances observed in a database by searching and revising the GDT. 
Here, two kinds of attributes need to be distinguished: condition attributes and 
decision attributes (sometimes called class attributes) in a database. Gondition 
attributes as possible instances are used to create the GDT, but the decision 
attributes are not. The decision attributes are normally used to decide which 
concept (class) should be described in a rule. Usually a single decision attribute 
is all that are required. 



3 Ways of Implementation 

We have tried several ways for implementing some aspects of the hybrid model 
stated in the above section. One possible way is to use the transition matrix 
in stochastic process as a medium for implementing a hybrid system [9]. Let us 
assume that some causal relation seems to exist between observations. Let the 
observations be classified into finite classes and represented by a state set. The 
scheme of transition process stated in Section 2.2 is used as the framework to 
represent the causal relation. Through learning in this framework a tendency 
of the transition between input and output is learned. If the transition matrix 
reveals the complete or approximate equivalence with logical inference, the log- 
ical expression can be discovered. Furthermore, the transition process can be 
represented by connectionist network for solving the space complexity. 

Another way is to use the GDT as a medium for generalization and dealing 
with uncertainty and incompleteness. A GDT can be represented by a variant 
of connectionist networks (GDT-NR for short), and rules can be discovered by 
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learning on the network representation of the GDT [16]. Fnrthermore, the GDT 
is combined with the rongh set methodology (GDT-RS for short) [15]. By nsing 
GDT-RS, a minimal set of rnles with larger strengths can be acqnired from 
databases with noisy, incomplete data. The strength of a rnle represents the 
nncertainty of the rnle, which is inflnenced by both nnseen instances and noises. 
Two algorithms have been developed for implementing the GDT-RS [2]. 

GDT-NR and GDT-RS stated above are two hybrid systems belonging to 
attribute- value learning^ which is a main stream in indnctive learning and data 
mining commnnities np to date. Another type of indnctive learning is relation 
learning ov called Indnctive Logic Programming (ILP) [3, 6]. 

ILP is a relatively new method in machine learning. ILP is concerned with 
learning from examples within the framework of predicate logic. ILP is relevant 
to data mining, and compared with the attribnte-valne learning methods, it 
possesses the following advantages: 

- ILP can learn knowledge which is more expressive than that by the attribnte- 
valne learning methods, becanse the former is in predicate logic while the 
latter is nsnally in propositional logic. 

- ILP can ntilize backgronnd knowledge more natnrally and effectively, be- 
canse in ILP the examples, the backgronnd knowledge, as well as the learned 
knowledge are all expressed within the same logic framework. 

However, when applying ILP to large real-world applications, we can identify 
some weak points compared with the attribnte-valne learning methods, snch as: 

— It is more difficult to handle numbers (especially continuous values) prevail- 
ing in real-world databases, because predicate logic lacks effective means for 
this. 

— The theory, techniques and experiences are much less mature for ILP to deal 
with imperfect data (uncertainty, incompleteness, vagueness, impreciseness, 
etc. in examples, background knowledge as well as the learned rules) than in 
the traditional attribute- value learning methods (see [3, 11], for instance). 

The discretization of continuous valued attributes as a pre-processing step is 
a solution for the first problem mentioned in the above [7]. Another way is to 
use Constraint Inductive Logic Programming (CILP), an integration of ILP and 
CLP (Constraint Logic Programming) [4]. 

For the second problem, a solution is to combine GDT (also GDT-RS) with 
ILP, that is, GDT-ILP and GDT-RS-ILP to deal with some kinds of imperfect 
data which occur in large real-world applications [5]. The GDT-RS-ILP system 
has been developing on the way. 

4 Conclusion 

In this paper, a hybrid model for rule discovery in real world data with uncer- 
tainty and incompleteness was presented. The central idea of our methodology 
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is to use the GDT as a hypothesis search space for generalization, in which the 
probabilistic relationships between concepts and instances over discrete domains 
are represented. By using the GDT as a probabilistic search space, (1) unseen 
instances can be considered in the rule discovery process and the uncertainty of 
a rule, including its ability to predict unseen instances, can be explicitly repre- 
sented in the strength of the rule; (2) biases can be flexibly selected for search 
control, and background knowledge can be used as a bias to control the creation 
of a GDT and the rule discovery process. Several hybrid discovery systems, which 
are based on the hybrid model, have been developed/developing. 

The ultimate aim of the research project is to create an agent-oriented and 
knowledge- oriented intelligent model and system for knowledge discovery 

and data mining in an evolutionary, distributed cooperative mode. That is, the 
work that we are doing takes but one step toward this model and system. 
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Abstract. In spatial reasoning the qualitative deseription of relations between 
spatial regions is of praetieal importanee and has been widely studied. Examples 
of sueh relations are that two regions may meet only at their boundaries or that 
one region is a proper part of another. This paper shows how systems of relations 
between regions ean be extended from preeisely known regions to approximate 
ones. One way of approximating regions with respeet to a partition of the plane 
is that provided by rough set theory for approximating subsets of a set. Relations 
between regions approximated in this way ean be deseribed by an extension of 
the RCC5 system of relations for preeise regions. Two teehniques for extend- 
ing RCC5 are presented, and the equivalenee between them is proved. A more 
elaborate approximation teehnique for regions (boundary sensitive approxima- 
tion) takes aeeount of some of the topologieal strueture of regions. Using this 
teehnique, an extension to the RCC8 system of spatial relations is presented. 



Keywords: qualitative spatial reasoning, approximate regions. 

1 Introduction 

Rough set theory [Paw91] provides a way of approximating subsets of a set when the 
set is equipped with a partition or equivalence relation. Given a set X with a partition 
{ai I i C 2}, an arbitrary subset b C X can be approximated by a function : 2 ^ 
{fo, po, no}. The value of (fbii) is defined to be fo if C b, it is no if fi 6 = 0, and 
otherwise the value is po. The three values fo, po, and no stand respectively for ‘full 
overlap’, ‘partial overlap’ and ‘no overlap’; they measure the extent to which b overlaps 
the elements of the partition of X. 

In spatial reasoning it is often necessary to approximate not subsets of an arbitrary 
set, but parts of a set with topological or geometric structure. For example the set X 
above might be replaced by a regular closed subset of the plane, and we might want 
to approximate regular closed subsets of A^. This approximation might be with respect 
to a partition of X where the cells (elements of the partition) might overlap on their 
boundaries, but not their interiors. Because of the additional topological structure, it is 
possible to make a more detailed classification of overlaps between subsets and cells in 
the partition. An account of how this can be done was given in our earlier paper [BS98]. 
This is, however, only one of the directions in which the basic rough sets approach to 
approximation can be generalized to spatial approximation. 



W. Ziarko and Y. Yao (Eds.): RSCTC 2000, LNAI 2005, pp. 445-453, 2001. 
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Our concern in the present paper is relationships between spatial regions when these 
regions have been given approximate descriptions. The study of relationships between 
spatial regions is of practical importance in Geographic Information Systems (GIS), 
and has resulted in many papers [EF91,RCC92,SP92]. Examples of relationships might 
be that two regions meet only at their boundaries or that one region is a proper part of 
another. While such relationships have been widely studied, the topic of relationships 
between approximate regions has received little attention. 

The structure of the paper is as follows. In section 2 we set out the particular type of 
approximate regions we use in the main part of the paper. In section 3 we discuss one 
particular scheme of relationships between regions, known as the RCC5, and in section 
4 we show how this can be generalized to approximate regions. In section 5 we briefly 
consider how our work can be extended to deal with more detailed boundary-sensitive 
approximations and the RCC8 system of relationships between regions. Finally in sec- 
tion 6 we present conclusions and suggest directions for further work. 

2 Approximating Regions 

Spatial regions can be described by specifying how they relate to a frame of reference. 
In the case of two-dimensional regions, the frame of reference could be a partition of 
the plane into cells which may share boundaries but which do not overlap. A region can 
then be described by giving the relationship between the region and each cell. 

2.1 Boundary Insensitive Approximation 

Approximation functions. Suppose a space R of detailed or precise regions. By imposing 
a partition, G, on R we can approximate elements of R by elements of . That is, 
we approximate regions in R by functions from G to the set J?3 = {fo, po, no}. The 
function which assigns to each region r € its approximation will be denoted as : 
R ^ . The value of {azr)g is fo if r covers all the of the cell g, it is po if r covers 

some but not all of the interior of g, and it is no if there is no overlap between r and g. 
We call the elements of the boundary insensitive approximations of regions r £ R 
with respect to the underlying regional partition G. 

Each approximate region stands for a set of precise regions, i.e. all those 

precise regions having the approximation X. This set which will be denoted [A] pro- 
vides a semantics for approximate regions: [A] = {r £ R \ a^r = A}. 

Operations on approximation functions. The domain of regions is equipped with a 
meet operation interpreted as the intersection of regions. In the domain of approxima- 
tion functions the meet operation between regions is approximated by pairs of greatest 
minimal, /\m%n-> and least maximal, /\max-> meet operations on approximation mappings 
[BS98]. 

Consider the operations and /\max on the set J ?3 = {fo, po, no} that are 
defined as follows. 



Amin 


no po fo 


A max 


no po fo 


no 


no no no 


no 


no no no 


po 


no no po 


po 


no po po 


fo 


no po fo 


fo 


no po fo 
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These operations extend to elements of (i.e. the set of functions from G to J? 3 ) by 



(A Ajjiin ^ )q — Ajjiin (1 g) 



and similarly for A max- This definition of the operations on is equivalent to the 
construction for operations given by Bittner and Stell [BS98, page 108]. 



2.2 Boundary Sensitive Approximation 



We can further refine the approximation of regions R with respect to the partition 
G by taking boundary segments shared by neighboring partition cells into account. 
That is, we approximate regions in R by functions from G x G to the set J?5 = 
{fo, fbo, pbo, nbo, no}. The function which assigns to each region r € -R its boundary 
sensitive approximation will be denoted a 5 : R^ The value of 

is fo if r covers all of the cell Qj, it is fbo if r covers all of the boundary segment, 
{gi^gj), shared by the cell gi and gj and some but not all of the interior of gi, it is pbo 
if r covers some but not all of the boundary segment {gi^gj) and some but not all of the 
interior of gi, it is nbo if r does not intersect with boundary segment {gi, gj) and some 
but not all of the interior of gi, and it is no if there is no overlap between r and gi. 

Let bs = {gi fi gj) be the boundary segment shared by the cell gi and gj. We define 
boundary sensitive approximation, a 5 , in terms of pairs of approximation functions, a^, 
as follows [BS98] (left table below): 



(as 


r)(gi 


j 9j) 


(as r)bs 


(as r)bs 


(as r)bs 








= fo 


= po 


= no 




r)9i 


= fo 


fo 


- 


- 




r)9i 


= po 


fbo 


pbo 


nbo 


(«3 


r)9i 


= no 


- 


- 


no 



Amax 


no 


nbo 


pbo 


fbo 


fo 


no 


no 


no 


no 


no 


no 


nbo 


no 


nbo 


nbo 


nbo 


nbo 


pbo 


no 


nbo 


pbo 


pbo 


pbo 


fbo 


no 


nbo 


pbo 


fbo 


fbo 


fo 


no 


nbo 


pbo 


fbo 


fo 



Each approximate region A^ € stands for a set of precise regions, i.e. all those 

precise regions having the approximation A : [[A| = {r e R \ a^r = A}. 

We define the operation A^a^ on the set J ?5 = {fo, fbo, pbo, nbo, no} as shown in 
the right table above. This operation extends to elements of (i.e. the set of func- 
tions from G X G to f? 5 )by (A Ajjf^ax ^ ){9ii9j) ~ {9ii9j)) ^max {9ii9j))- The 

definition of the operation is similar but slightly more complicated. The details 
can be found in [BS98]. 



3 RCC5 Relations 

Qualitative spatial reasoning (QSR) is a well-established subfield of AT It is concerned 
with the representation and processing of knowledge of space and activities which de- 
pend on space. However, the representations used for this are qualitative, rather than the 
quantitative ones of conventional coordinate geometry. One of the most widely studied 
formal approaches to QSR is the Region-Connection Calculus (RCC) [CBGG97]. This 
system provides an axiomatization of space in which regions themselves are primitives, 
rather than being constructed from more primitive sets of points. An important aspect 
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of the body of work on RCC is the treatment of relations between regions. For exam- 
ple two regions could be overlapping, or perhaps only touch at their boundaries. There 
are two principal schemes of relations between RCC regions: five boundary insensitive 
relations known as RCC5, and eight boundary sensitive relations known as RCC8. 

In this paper we propose a specific style of defining RCC relations. This style allows 
to define RCC relations exclusively based on constraints regarding the outcome of the 
meet operation between (one and two dimensional) regions. Furthermore this style of 
definitions allows us to obtain a partial ordering with minimal and maximal element on 
the relations defined. Both aspects are critical for the generalization of these relations 
to the approximation case. 

Given two regions x and y the RCC5 relation between them can be determined by 
considering the triple of boolean values: 

{x Ay ^ X Ay = Xj x Ay = y). 



The correspondence between such triples, the RCC5 classification, and possible geo- 
metric interpretations are given below. 




RCC5 

DR 

PO 

PP 

PPi 

EQ 



PP(x,y) 




DR(x,y) PO(x,y) PPI(x,y) EQ(x,y) 



The set of triples is partially ordered by setting (ai , U2 , U3 ) < (6i , 62 , ^3 ) iff < bi 
for i — 1,2,3, where the Boolean values are ordered by F < T. This is the same 
ordering induced by the RCC5 conceptual graph [GC94]. But note that the conceptual 
graph has PO and EQ as neighbors which is not the case in the Hasse diagram for 
the partially ordered set. The ordering is indicated by the arrows in the figure above. 
We refer to this as the RCC5 lattice to distinguish it from the conceptual neighborhood 
graph. 



4 Semantic and Syntactic Generalizations of RCC5 

The original formulation of RCC dealt with ideal regions which did not suffer from 
imperfections such as vagueness, indeterminacy or limited resolution. However, these 
are factors which affect spatial data in practical examples, and which are significant in 
applications such as geographic information systems (GIS)[BF95]. The issue of vague- 
ness and indeterminacy has been tackled in the work of [CG96] . The topic of the present 
paper is not vagueness or indeterminacy in the widest sense, but rather the special case 
where spatial data is approximated by being given a limited resolution description. 

4.1 Syntactic and Semantic Generalizations 

There are two approaches we can take to generalizing the RCC5 classification from 
precise regions to approximate ones. These two may be called the semantic and the 
syntactic. The syntactic has many variants. 
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Semantic generalization. We can define the RCC5 relationship between approxi- 
mate regions X and Y to be the set of relationships which occur between any pair of 
precise regions representing X and Y. That is, we can define 

§£M(X,F) = {RCC5{x,y) | x e {Xj md y e {Yj}. 

Syntactic generalization. We can take a formal definition of RCC5 in the precise 
case which uses operations on R and generalize this to work with approximate regions 
by replacing the operations on R by analogous ones for . 

4.2 Syntactic Generalization 

The above formulation of the RCC5 relations can be extended to approximate regions. 
One way to do this is to replace the operation A with an appropriate operation for 
approximate regions. If X and Y are approximate regions (i.e. functions from G to J? 3 ) 
we can consider the two triples of Boolean values: 

(A Amin ^ ^ X Amin ^ — X^ X Amin ^ ^ 

(A Amax ^ ^ X Amax ^ = A, A Amax ^ = 1 ). 

In the context of approximate regions, the bottom element, ±, is the function from G to 
J ?3 which takes the value no for every element of G. Each of the above triples provides 
an RCC5 relation, so the relation between A^ and Y can be measured by a pair of RCC5 
relations. These relations will be denoted by Rmin (A, Y) and Rmax (A, Y). 

Theorem 1 The pairs {Rmin (A, F), Rmax (A, F)) which can occur are all pairs (a, b) 
where a <b with the exception of{PP^ EQ) and (PPi, EQ). 

Proof First we show that A, F) < Rjy^axi^iY). Suppose that (A, F) = 
(ai,a 2 ,a 3 ) and that Rmaxi^iY) = (6i, 52,^3)- We have to show that ai < bi for 
i = 1,2,3. Taking the first component, if A A ^ ± then for each g such 
that Xg Amin Yg ^ no, we also have, by examining the tables for Amin and Amax-> 
that Xg Amax Yg ^ no. Hence A Amax Y ^ ±. Taking the second component, if 
A Amin Y — X then A Amax Y — X because from Xg Amin Yg — Xg it follows 
that Xg Amax Yg = Xg. This can be seen from the tables for A^^^^ and Amax by 
considering each of the three possible values for A g. The case of the third component 
follows from the second since A^^^^ and Amax are commutative. 

Finally we have to show that the pairs (PP, EQ) and (PPi, EQ) cannot occur. If 
Rmax (A, F) = EQ, then A = F so A Amin Y — X must take the same value as 
A^ Amin Y — Y. Thus the only triples which are possible for F) are those 

where the second and third components are equal. This rules out the possibility that 
Rmin {X, Y) is PP or PPi. [| 

^ This technique has many variants since there are many different ways in which the RCC5 can 
be formally defined in the precise case, and some of these can be generalized in different ways 
to the approximate case. The fact that several different generalizations can arise from the same 
formula is because some of the operations in R (such as A and V) have themselves more than 
one generalization to operations on . 
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4.3 Correspondence of Semantic and Syntactic Generalization 

Let the syntactic generalization of RCC5 defined by 



SyH(X,F) = 



where Rmin and Rmax are as defined above. 

Theorem 2 For any approximate regions X and Y, the two ways of measuring the 
relationship of X to Y are equivalent in the sense that 

S£M(X,F) = {P € RCC5 | RmUXX) <P< Rmaxi^X)}. 

where RCC5 is the set {EQ, PP, PPi, PO, DR}, and < is the ordering in the RCC5 
lattice. 

The proof of this theorem depends on assumptions about the set of precise regions 
R. We assume that is a model of the RCC axioms so that we are approximating 
continuous space, and not approximating a space of already approximated regions. 
Proof There are three things to demonstrate. Firstly that for all x € [[X]|, and 
y e PI, that RrrUni'XX) < RCXo{x,y). Secondly, for all X and y as before, 
that J?(7(75(x, y) < Rjy^axiX^ Y), and thirdly that if p is any RCC5 relation such that 
Rniin {Xj Y) < p < Rmax {Xj Y) then there exist particular x and y which stand in the 
relation p to each other. To prove the first of these it is necessary to consider each of the 
three components X Amin X ^ -L, X Amin X = X and X Amin X = Y in turn. If 
X Amin X 7 ^ ± is true, we have to show for all x and y that xAy ^ ±is also true. From 
X Amin X 7 ^ ± it follows that there is at least one cell g where one of X and Y fitlly 
overlaps p, and the other at least partially overlaps g. Hence there are interpretations of 
X and Y having non-empty intersection. If X Amin X = X is true then for all cells g 
we have Xg = no or Yg = fo. In each case every interpretation must satisfy xAy = x. 
Note that this depends on the fact that the combination Xg = po = Yg cannot occur. 
The case of the final component Amin X — X is similar. Thus we have demonstrated 
for all X e [A] and y e p""] that Rmm{Xj X) < RCC5{x^ y). The task of showing 
that RCC5{x^y) < RmaxiXjX) is accomplished by a similar analysis. Finally, we 
have to show that for each RCC5 relation, p, where Rmin (X^ X) < p < Rmax (X^ A), 
there are x € [A] and y € p^J such that the relation of to p is p. This is done by 
considering the various possibilities for RminiX^X) and RmaxiXjX). We will only 
consider one of the cases here, but the others are similar. If RminiX^X) — PO and 
Rmax {Xj A) = EQ, then for each cell p, the values of Xg and Ap are equal and there 
must be some cells where this value is po and some cells where the value is fo. Precise 
regions x € [[A]| and y € p^J can be constructed by selecting sub-regions of each cell p 
say Xg and p^, and defining x and y to be the unions of these sets of sub-regions. In this 
particular case, there is sufficient freedom with those cells where Xg — Yg — po to be 
able to select x, and ,, so that the relation of x to , can be any p where PO < p < EQ. 
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5 Generalizing RCC8 Relations 

5.1 RCC8 Relations 

RCC8 relations take the topological distinction between interior and boundary into ac- 
count. In order to describe RCC8 relations we define the relationship between x and y 
by using a triple, but where the three entries may take one of three truth values rather 
than the two Boolean ones. The scheme has the form 

{x A y ^ X A y ^ X , X A y ^ y) 



where 



X Ay ^ ± 



and where 



T if the interiors of x and y overlap, 

f.e., X Ay ^ ± 

M if only the boundaries x and y overlap, 

f .e., X Ay — ± and 6x A 6y ^ ± 

F if there is no overlap between x and y, 

f .e., X Ay = ± and 6x A 6y = ± 



' T if is contained in y and the boundaries are either disjoint or identical 
f .e., X Ay — X and {5x A5y — 1. orfe A5y — 5x) 
xAy ^ = < M if is contained in y and the boundaries are not disjoint and not identical, 

f .e., X Ay = X and 6x A 6y ^ ± andfe A6y ^ 6x 
F if is not contained within y^i.e.^x Ay ^ x 



and similarly for A y ^ y. 

The correspondence between such triples, the RCC8 classification, and possible 



geometric interpretations are given below. 
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The RCC5 relation DR refines to DC and 
TPP(i) and NTPP(i). We define F < M < 
RCC8 lattice (figure above) to distinguish i 




DC(x,y) EC(x,y) PO(x,y) TPP(x,y) NTPP(x,y) EQ(x,y) 



EC and the RCC5 relation PP{i) refines to 
T and call the corresponding Hasse diagram 
Lt from the conceptual neighborhood graph. 



5.2 Syntactic Generalization of RCC8 

Let and Y be boundary sensitive approximations of regions x and y. The generalized 
scheme has the form 



((A Araifi 1 J_, A Araifi 1 A , A Amifi 1 ^ 1 ), 

(A Ajii^ax A, A Ajii^ax ^ A, A Ajii^ax 1 )) 
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where 



(TX Amin y # -L 

XAminY^^^i MXAminY = ±^nd6XAmiu6Y^± 

F A. A min ^ and Amin — -L 

and where 

{ T A Amin ^ ^ and (^A Amin — A or A Amin ^ ^ ) 

M A F = A and 5A ^ A and A A^^^ F ^ Y 

F AA^.^F# A 

and similarly for A AmAn^ ^ 1 , A Am,ax^ ^ ^ ^ ^ ,and A Am,ax^ ^ ^ • 

In this context the bottom element, A, is the function from G x G to J ?5 which takes the 
value no for every element of G x G. The formula 5X Amin SY 7^ A is true if we can 
derive from boundary sensitive approximations A and F that for all x E [A] and y E 
|F] the least relation that can hold between x and y involves boundary intersection^. 
Correspondingly, 5X Amax SY 7 ^ A is true if the largest relation that can hold between 
X E [[A]| and y E |F] involves boundary intersection. 

Each of the above triples defines a RCC 8 relation, so the relation between A^ and 
F can be measured by a pair of RCC 8 relations. These relations will be denoted by 
(A, F) and Rma^x Y). [BSOO] show the correspondence between this syntactic 
generalization and the semantic generalization corresponding to the RCC5 case. 

6 Conclusions 

In this paper we discussed approximations of spatial regions with respect to an underly- 
ing regional partition. We used approximations based on approximation functions and 
discussed the close relationship to rough sets. We defined pairs of greatest minimal and 
least maximal meet operations on approximation functions that constrain the possible 
outcome of the meet operation between the approximated regions themselves. The meet 
operations on approximation mappings provide the basis for approximate qualitative 
spatial reasoning that was proposed in this paper. 

Approximate qualitative spatial reasoning is based on: (1) Jointly exhaustive and 
pair-wise disjoint sets of qualitative relations between spatial regions, which are de- 
fined in terms of the meet operation of the underlying Boolean algebra structure of 
the domain of regions. As a set these relations must form a lattice with bottom and 
top element. (2) Approximations of regions with respect to a regional partition of the 
underlying space. Semantically, an approximation corresponds to the set of regions it 
approximates. (3) Pairs of meet operations on those approximations, which approximate 
the meet operation on exact regions. 

Based on those ‘ingredients’ syntactic and semantic generalizations of jointly ex- 
haustive and pair-wise disjoint relations between regions were defined. Generalized re- 
lations hold between approximations of regions rather than between (exact) regions 
themselves. Syntactic generalization is based on replacing the meet operation defining 

^ For details see [BSOO]. 
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relations between regions by its greatest minimal and least maximal counterparts on 
approximations. Semantically, syntactic generalizations yield upper and lower bounds 
(within the underlying lattice structure) on relations that can hold between the corre- 
sponding approximated regions. 

There is considerable scope for further work building on the results in this paper. 
We have assumed so far that the regions being approximated are precisely known re- 
gions in a continuous space. However, there are practical examples where approximate 
regions are themselves approximated. This can occur when spatial data is required at 
several levels of detail, and the less detailed representations are approximations of the 
more detailed ones. One direction for future investigation is to extend the techniques 
in this paper to the case where the regions being approximated are discrete, rather than 
continuous. This could make use of the algebraic approach to qualitative discrete space 
presented in [SteOO]. Another direction of ongoing research is to apply techniques pre- 
sented in this paper to the temporal domain [BitOO]. 
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Abstract. This paper discusses the change for the conditional indepen- 
dence set in learning Probabilistic Network based on markov property. 

They are generalized into several cases for all of possible changes. We 
show that these changes are sound and complete. Any structure learning 
methods for the Decomposiable Markov Network and Bayesian Network 
will fall into these cases. This study indicates which kind of domain model 
can be learned and which can not. It suggests that prior knowledge about 
the problem domain decides the basic frame for the future learning. 

1 Introduction 

A probabilistic network {PN) combines a qnalitative graphic strnctnre which 
encodes dependencies of domain variables, with a qnantitative probability dis- 
tribntion which specify the firmness of these dependencies [1], [3]. As many 
effective probabilistic inference techniqnes have been developed and the proba- 
bilistic network has been applied in many artificial intelligence domains, many 
researchers tnrn their attention to antomatically obtain a model At of PN from 
data [2], [4], [8]. 

Connting from a complete graph with n variables, there are varions models 
with 1, 2, ...m, m < n(n — 1) /2 links missing. We call all of these models searching 
space for structure learning [9]. The structure learning is to find an efficient way 
to pick np the one in searching space which incorporates U/s of At the most. 
Starting from a PN with certain graphical strnctnre which characterizes an 
original CIs set, there exist some ”false’^ CIs which are snperflnons or missing 
actnal one of At, links are picked np to be deleted or added on acconnt of 
different approaches (constraint-based or score-based). When a certain criterion 
is satisfied, the ”pick-np’^ action comes to stop and the final CIs set is obtained 
which is intnitively looked as the representation of At [4], [5]. [6], [7] A more 
extensive review of literatnre for learning PN can be fonnd in [8], 

However, not all of CIs set can be incorporated into DMN or BN, [1] indi- 
cates that a class of domain models cannot be learnded by a certain procednre. 
To find which kind of CIs can be learned and which can not, normally we have 
to nse d-seperation to make the new CIs set campared with the original one [3] 
In this paper, we discnss all of possible branchings coming from an original CIs 
set correspoding to DMN or BN respectively and generalize into serveral cases. 

W. Ziarko and Y. Yao (Eds.): RSCTC 2000, LNAl 2005, pp. 454-461, 2001. 
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It is proved that these cases are sound and complete. Any structure learning 
methods for DMN or BN will fall into this category. 

We start from any probabilistic structure, which could be looked as the prior 
knowledge about the problem. Links will be added or deleted in order to match 
the data. As soon as the original structure is fixed, we argue that all of the 
possible changes are also fixed correspondingly, which means the prior knowl- 
edge actually already draws the frame for the future refreshment of the domain 
knowledge. 

2 The Cases for Decomposable Markov Network 

In this section, we describe a set of rules on how Cl is changed after one link 
is deleted each time from the undirected graphical structure of DMN. It is 
also proved that these rules are sound and complete. A Decomposable Markov 
Network is a pair {G^P) where G is a chordal graph over n variables and P is 
an associated J PD which can be factorized as 

m m 

* = 1 * = 1 

where p{Ci) are probability functions defined on the maximal cliques Ci, . . . , Cm 
of G, p{Si) are probability functions defined on the separator of Ci as Si = 

Cn(CiU...uC-i) [2]. 

After a link {x,y) is deleted from a DMN, there should be corresponding 
changes taken place in the CIs set as well. All kinds of possible changes are 
generalized as follows. Such cases could apply concurrently if their conditions 
are satisfied at the same time. We use to denote the CIs set change after 

delete one link from the original graph. 

Proposition 1 . Given two cliques Ci^ C2 in the graph structure G of a DMN. 
Link {x,y) G U C2* Suppose this link is deleted from the G, following rules 
are applied to justify the change of CI; 

Rl: Suppose nodes x,y G C\, this clique will break into two smallers: 

0 ^ /(Cl - W, Cl - {*y}, Cl - M) 

R 2 : Suppose node y G Ci but y ^ C\ O C2^ node x E C\ D C27 y ^ Ci: 

n C2, C2) ^ /(Cl - {x}. Cl n C2 - {x}, C2) 

A /(Cl - {y}, Cl n C2, C2) A /(Cl - {x}. Cl - {a., y}, Ci - {y} 

R 3 : Suppose nodes x ,y ^ C\C\C2^ but x,y E C\ 

n C2, C2) ^ /(Cl - {x}. Cl n C2, C2) a i{c\ - {y}, Ci n C2, C2) 

Rj: Suppose nodes x,y E CiC\C2^ C\ ^ C2; the DMN could become non-chordal. 
R 5 : Suppose Ci Pi C 2 = 0^ node x E Ci, node y E C2 

I{Ci,{x}, {*, y}) A /(C2, {y}, {x, y}) => /(Ci, 0 n C2) 
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Proof: After a link {x^ y) is deleted from a complete graph, node x conld connect 
all of other nodes except y, variable y conld connect all of other nodes except x 
as well. Other variable nodes still connect to each other. Therefore, there are two 
cliqnes after deletion: one inclndes variable x and all of other nodes except y; 
the other inclndes variable y and all of other nodes except x. Their separator is 
all of variables set except x, y, thns, we may get I{C — {x}, C — {x^ y}, C — {y}) 
(R1 applies). 

Snppose y ^ Cl bnt y ^ CiC\C 2 - According to the proof of Rl, after (a?, y) is 
deleted, C\ will break into two cliqnes C\ — {x} and C\ — {y}. Since x ^ C\ 0 672, 
we get (67i — {x}) O 672 = 67i O 672 — {x}, (67i — {y}) O 672 = 67i O 672. Therefore, 
7^(67i — {x}, 67i 0 672 — {x}, 672) and /(67i — {y}, 67i O 672 , 672) hold, respectively 
(R2 applies). 

According to the proof of Rl, after (a?, y) is deleted, 67i will break into two 
cliqnes 67i — {x} and 67i — {y}. Since x^y ^ CiC\ 672, we get (67i — {x}) O 672 = 
67i n 672, (67i - {y}) O 672 = 67i O 672 . Therefore, I{C - {x},C - {x,y},C - {y}), 
/(67i — a?, 67i 0 672, 672), /(67i — y,Ci C\ C 2 , C 2 ) hold, respectively (R3 applies). 

Snppose 67i 2 672, there mnst be at least one variable in one cliqnes bnt not 
in the other, thns, u E v E C 2 ^ while 1 / ^ 672, ^ 67i, which means they are 

not connected with each other, thns, (u, 1 ;) doesnd exist. If x,y E CiC\C 2 , these 
two variables 1 /, v mnst connect to x, y, respectively. Therefore, u — x — y — v 
constrnct into a circle with length 4. There are only two chords (a?, y) and (u, 1 ;) 
in the graph, since (u,i;) doesnd exist, after we delete {x,y), this graph wonld 
be non-chordal (R4 applies). 

According to given condition, {x, y) is the only link connect 67i with 672. After 
deletion, cliqnes 67i and 672 will not connected any more. (R5 applies) 

Proposition 2. Rules R1-R5 are complete for all possible CIs change by delet- 
ing one link each time in the graphical structure of a DMN. 

Proof: By indnction, there is no other possibility to choose a link except above. 
Therefore, no matter which link we choose to delete in the graphical strnctnre 
of M/j, there are always some of rnles R1-R5 conld apply to the corresponding 
67/s set change. 

Example /. Given a complete graph with 4 variables, fignre 1 indicates how to 
apply rnles above to find the change of CIs set correspondingly. 

3 The Cases for Bayesian Network 

If we might have to consider the directionality in a directed graphical strnc- 
tnre, it is mnch more complicated to stndy the 67/s set change in a Bayesian 
Network(RTV) than that of DM N . However, Pearl [3] showed that it is the direc- 
tionality that makes R7V a richer langnage in expression of dependencies. Some 
67/s can be expressed by a BN bnt not a DMN . As BN and DMN are so 
closely related, stndy on one of both will benefit the stndy on the other as well. 




A Study of Conditional Independence Change in Learning Probabilistic Network 457 




Fig. 1. Applying rules for CIs set change to a graph with 4 variables at different 
topological structure. 



A Bayesian network is a pair (D^P) where 77 is a directed acyclic graph 
(DAG) over n variables and P is an associated J PD which can be factorized as 

n 

P{vi,...,Vn) =Y[p{vi\pa{vi)) (2) 

i=l 

where pa{vi) is the parents set of node Vi in D [2]. 

3.1 Local directed Markov property 

For the DAG 77 in a BN , it is proved that if 7^(t^i, . . . , Vn) admits factorization 
on acconnt of 77, P mnst obey the local directed Markov property (DL) [2], thns, 
any variable v is conditionally independent of its non-descendants, nd{v) namely, 
given its parents pa{v) in the DAG, thns, I{v,pa{v),nd{v)), 

This property paves the way on how G7s set changes correspond to the 
arrow variation in the BN . It is discnssed in this section how the DL property 
is changed among two nodes, Vi and Vj, their corresponding ancestors, Ai and 
Aj and corresponding descendants, 77* and Dj after the arrow direction between 
these two nodes is changed or the arrow is deleted. We take ’T* to denote 

an arrow from Vi to vj. If there is a path from Vi to Vj , it is denoted as Vi \ — y Vj . 
Assnming it is clear enongh in the context, we also nse A* i — y Vi to denote there 
is a path from a node in A* to Vi and A* i — y Aj to denote there is path from a 
node in A* to a node in Aj. 

Lemma 1. For those nodes without path to or from Vi and Vj in a DAG, the 
arrow variation of Vi Vj (delete or change arrow direction) will not change 
their DL properties. 

Proof: Assnme is one of these nodes, for a node Va in A* U Aj , there is no 
path from to it becanse Va ' — y Vi. It follows that A* U A ^ C nd{vk)^ 

There is no path from the node Vd G 77* U Dj to , otherwise Vi \ — y Vk or 
Vj I — y Vk. If Vk I — 7^ Vd, then nd{vd) C nd{vk). After arrow variation, if there is 
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a path from G de{vd) to a node G nd{vd)^ it must come to node via Vi 
or Vj. It is impossible because v' \ — y ^ ^ forms a cyclic which 

will conflict with the definition of 57V. 

Lemma 2. After the direction of an arrow Vi Vj is changed in BN ^ a path 
forms a cycle if and only if Vi \ — y Aj , And we call this path as weak path. 

Proof: It is easy to verify Vi \ — y Aj is a weak path. The remaining part of proof 
is to show there exists only this kind of weak path. 

If Vi I — y Aj , there is a node v with Vi \ — y v \ — y Aj, which means v ^ Di, 

Di I — y Aj. Furthermore, since Aj I ^ Vj is trivial, D{ \ — y Vj holds. 

If Di I — y Vj , there must be a node v G Di with v \ — y Vj . Therefore v E Aj , 
Di I — y Aj holds. Since Vi \ — y Di is trivial, if Di \ — y Aj, Vi \ — y Aj is trivial. 

If Di I — y Aj, Vi Di is trivial, Vi Aj exists. Since Aj \ — y Vj is trivial, 

Di I — y Vj holds. 

There is no other possibility to form a weak path related to Vi Vj . 

Lemma 3. For Vi^ Vj , their ancestors set Ai^ Aj and their descendants set Di^ 
Dj, the only possible kinds of path are: Ai \ — y Aj^ Aj \ — y Ai^ Ai \ — y Dj ^ 
Aj ^ Di, Di ^ Dj, Dj ^ Di. 

The possible paths are shown in the Figure 2. Since the discussion of possible 
paths for Vj is exactly the same as that of Vi after arrow direction is changed, 
the proposition for them will apply symmetrically for that of Vj and Vj itself. 




Fig. 2. Possible paths from Vi, Ai and Di to other components which is shown in dash 
arrow. 



3.2 Arrow direction change 

After arrow direction is changed from Vi Vj to Vi Vj, the non-descendant 
sets of Vi, Va^ G Ai and Vd^ G Di could be changed correspondingly. In this 
subsection, we discuss how these non-descendant set change take impact on the 
related Cl changes. 
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Proposition 3. After arrow direction is changed from Vi Vj to Vi vj, if 
there is a node Vd^ G D{ with Vd^ ' — 7 ^ Dj ^ then 

I{vi,pa{vi), {{Ai\jAj)-pa{vi))) => I{vi,pa{vi)\Jvj , (( A,- UAjU£>')-pa («,•))); (3) 
where Dj = Dj — {Dj fl nd{vdj); Otherwise 

I{vi,pa{vi), {{Ai\jAj)-pa{vi))) => I{vi,pa{vi)\Jvj , {{Ai\jAj\JDj)-pa{vi))). (4) 

Proposition 4. After arrow direction is changed from Vi Vj to Vi vj , for 
a node Va^ G Ai^ if there is a path Ai \ — y Aj^ then its corresponding Cl will not 
change; or else if there is a node Vd^ G D{ with Vd^ ' — 7 ^ Dj ^ then 

^{Va,,pa{va,), {A[\jAj)-pa{Vad) => I{Va^,pa{Vai), {A[\J AjOVjO D'j - pa{vad)', 

( 5 ) 

where A'- = A,- — deiva^), = Dj — {Dj fl nd{vdj); Otherwise 

I{vi,pa{vi), {A{uAj)-pa{vaJ) => I{vi,pa{vi)Uvj , {A{uAjUvjUDj)-pa{vad). 

( 6 ) 

Proposition 5. After arrow direction is changed from Vi Vj to Vi vj , for 
a node Vd^ G there is no any change on its corresponding CD 

Proposition 3 and 4 provide rnles for the corresponding nodes in the DAG. 
It shonld be noticed that another node conld also characterizes its own DL 
property in terms of the same CD 

3.3 Arrow deleted 

After arrow Vi Vj is deleted from the DAG of a BN , the non-descendant sets 
of Vi, Va^ G Ai and Vd^ G Di conld also be changed correspondingly. We discnss 
these related CIs change in this snbsection. 

Proposition 6. If there is a weak path related to Vi Vj , then deleting this 
arrow Vi Vj will not change any CD 

Proposition 7. After arrow Vi Vj is deleted, if there is a node Vd^ G Di with 
Vd^ I — 7 ^ Dj, then 

I{vi,pa{vi),Ai U Aj) => I{vi,pa{vi),Ai U Aj U Vj U £>') (7) 

where Dj = Dj — de{vd'); Otherwise 

I{vi,pa{vi),Ai U Aj) ^ I{vi,pa{vi),Ai U Aj U Vj U Dj). (8) 

Proposition 8. After arrow Vi Vj is deleted from the graphical structure of 
a BN , for the node Va^ G Ai, if there is path Ai \ — y Aj , there is no change for 
corresponding Cl of node else if there is a node Vd^ G Di with Vd^ ' — 7 ^ Dj , 
then 

I{^a,,pa{vai),A{ U Aj) ^ I{va,,pa{va,),A{ U Aj Uvj UD'j) (9) 

where Dj = Dj — de{vd^), A{ = A,- — de{va^); Otherwise 

I{’>’a,,pa{vai),A'-UAj) ^ I{va^,pa{vai),A'- U Aj U Vj U Dj). (10) 
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Proposition 9. After arrow Vi Vj is deleted from the graphical structure of 
a BN ^ there is no any change on its corresponding Cl for the node Vd^ G Di. 

3.4 Soundness and Completeness 

Proposition 10. Proposition 3-6 are sound and complete for the corresponding 
CIs set changes of a given BN after an arrow on its graphical structure is deleted 
or changed. 

Proof: Soundness is easy because it is followed directly from the analysis of DL 
property change on each kind of possible path which is shown in Lemma 3. Here 
we will prove completeness by induction, 

Suppose therer are only 2 links in the DAG of a given BN . Figure 3 lists all 
of the possible topological structures they may have: 




(a) (b) (c) 



Fig. 3. All of possible topological structures for two links in a DAG 



For figure 3(a), suppose we pick up arrow x ^ y and change its direction, the 
corresponding CIs change is: I{x^ 0, 0) =y I{x^ y, 0) and /(y, x^ z) =y I(y, 0, 0). 

There is still I(y, x, z) for node z before and after the arrow direction change. 
Therefore, after arrow direction is changed, the CIs set does not change. The 
same result is followed by figure 3(b) and (c). 

For all of these figures, as soon as one arrow is deleted, A*, Aj, D{ and Dj 
will be empty. From the equation(7), (8), (9) and (10), there is no Cl exists. 

Assuming there are k links in the DAG, each time when an arrow is picked 
up to change the arrow direction or be deleted, it follows the Proposition 3-6. 

Given a DAG with k 1 links, we may only constrain our consideration 
on a chosen link A nodes, their ancestors set and descendant set according to 
lemma 1. If arrow direction is to be changed, we may have to choose the link 
without weak path related to it. Lemma 3 lists all kinds of possibility to choose. 
In this way, proposition 3- 5 apply. If arrow is to be deleted, suppose there is weak 
path related to this arrow, then nothing is changed in the CIs set according to 
proposition 6. If there is no weak path related to this arrow, then proposition 7 
to 9 will apply to the corresponding CIs change. After this arrow is deleted, 
there are only k links left in the DAG, assumption (2) will apply. 
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4 Conclusion 

In this paper, we discussed how the CIs set is changed corresponding to link 
variation in the DMN and arrow variation in the BN . It is proved that these 
changes are sound and complete in the DMN or BN, respectively. We argue 
that any structure learning aimed to obtain a CIs set in terms of DMN or 
BN, then our proposition should apply. It is also indicted clearly to the inside 
mechanism of structure learning step by step. The question which kind of CIs 
set could be learned and which can not be is also answered. In other point of 
view, it suggests that the prior knowledge actually decides the scope of problem 
domain we can learned from the data in the future. In this way, we pave the way 
to the feasibility study of structure learning algorithm. 
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Abstract. In the paper, it defines a A-level rough equality relation 
in rough logic and establishes a rough sets on real-valued information 
systems with it. We obtain some related properties and relative rough 
paramodulation inference rules. They will be used in the approximate 
reasoning and deductive resolutions of rough logic. Finally, it proves that 
A-level rough paramodulant reasoning is sound. 



1 Introduction 

Pawlak introduced the rough sets via equivalence relationl^l. It provides a ba- 
sic theory of handling uncertainty knowledge, hence it is used in studying the 
partial truth values in logic. The theory have been an inspiration for logical 
investigation and computer scientists. They try to establish a logical systems 
of approximate reasoning, so that it is used to handle incomplete informations 
and to deduce partial truth values. Partial truth values are usually researched 
and interpreted through employing information table, hence information system 
is used as studying start point of rough logic. Equality objects are same de- 
scription with respect to attribute sets, thus to lead to approximate definition 
for the sets of objects with indiscernibility relation. We define the equality of 
two objects employing the A-level rough equality of their attribute values under 
given accuracy, we call it a A-level (or A-degree) rough equality relation, denoted 
by =XR- We may derive several of its properties and relative inference rules of 
A-level rough palamodulation. It is illustrated in the successful applications of 
approximate reasoning and deductive resolution with real examples. 

Pawlak defined the rough equality of two sets in his works^^l, namely the 
set X is equal to the set Y if A R*{X) = R*(Y), written by 

X y, where X^Y C [7 is any subset on universe U of objects, R is an 
equivalence relation, R^ and R* is the lower and upper approximate sets of X 
respectively. Banerjee and Chakraborty introduced a new binary conjunctive 
«, it is defined by two modal operators L (necessary ) and M (possible <)) in 
the 5s of modal logic. Namely, the formula a is rough equal to formula j3 iff 
{La L/3) A {Ma M/3). It points out the rough equality of two formulas 

in modal logic. In the fact, the rough equality relation conjunctive « is added 
into the 5$ of modal logic, we will derive several properties about the rough 
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equality relation conjunctive Hence we bear in mind the idea of defining a 
A-level rough equality relation conjunctive =xr in the rough logic, we may also 
obtain the related several properties in rough logic from it. And to have relative 
A-level rough paramodulation inference rules. 

2 A-level Rough Equality Relation 

The comparison of arbitrary two real numbers avoids always to use equality re- 
lation sign =, because absolute equal phenomenon is always rarely seen. Hence, 
we may use the comparison between absolute value of their subtraction to arbi- 
trary small positive number e if it is needful. Such as, x — y might be written 
abs{x — y)<e^ where abs is an abbreviation of the word absolute, e is an enough 
small positive number. Thus we introduce a A-level rough equality relation con- 
junctive =xji to be useful and necessary. 

Definition 1. Let K = (U,A,V,f) be a knowledge representation system 
where 

(1) . U is an universe of discourse objects; 

(2) . A is a set of attributes ; 

(3) . V = [JaeA^a is the union of sets of attribute values for each attribute 
a 6 A, here the a{x) of each object x e U with respect to attribute a e A is 
a real number. Namely, The knowledge representation systems we give are an 
information table based on real values used as attribute values; 

(4) . f : A X U V is 8i, mapping, it is interpreted as: for each attribute a e A 
and object x e U, the feature of x with respect to attribute a or the character 
of x possessing attribute a is transformed into a value of quantities, namely a 
real number a{x) = Va G [0, 1]. x don’t have the feature of attribute a entirely if 
Va =0] X has full feature of the attribute a if Ua = 1; ^ has Va degree feature of 
the attribute a if Va > 0. 

For example, the attribute a is the symptom of temperature for a patient, thus 
the temperature is transformed into the value of quantities, a{x) = Va ^ [0, 1] 
is a real number, = 1 if temperature > 40 degree c, which represents high 
symptom of temperature for the patient; = 0 if 36 degree c < temperature 
< 37 degree c, which represents formal temperature for the patient; Ua > 0 if 
40 degree c > temperature >37 degree c, which represents Va degree fervor 
symptom for the patient. 

For a given knowledge representation system, due to the set of attribute val- 
ues is of real numbers, in order to conveniently using equality relation reasoning 
on set of real numbers, we define a A-level rough equality relation =xr^ Which 
describes the equal degree of attribute values for two objects xi and X 2 with re- 
spect to attribute a. If the equality degree greater or equal to the value A, then 
two objects are thought to be indiscernibility, we call it A-level rough equality. 
Its formal definition is described as follow: 

Definition 2. Let S = (U,A,V,f) be an information system, for Vxi, Vx 2 £ U 
and a £ A, 

xi=\rX 2 iff \f{a,xi)- f{a,X 2 )\<€, 




464 Q. Liu 



where e is a given approximate accuracy, we call xi A-level rough equality to X 2 - 
Obviously, e and equal degree or equal grade fact A is the following relation: 

e = 1 — A. 



It follows that they are complementary. 

For example, let S = (P, A, V, /) represent a system of Chinese Tradition Medicine, 
where P is the set of the patients, A is the set of the symptoms, V is the set of 
real values of attributes. It is represented as following table. 

P\A abed 
Pi 0.25 0.03 0.01 0.95 
P2 0.25 0.03 0.01 0.9 

We may obtain that pi and p 2 is 0.4-level rough equality with respect to at- 
tribute d from the table. 

Property 1. Let S = (U,A,V,f) be an information system, u\rj is an assign- 
ment symbol to formulas, then which satisfies: 

(1) - u\Ri{x =XR x) = 1, for all X 6 c/; 

(2) . uxRi{x =XR x^) =XR ^'XRiix' =XR x),fov all X, P G P; 

(3) . =XR x'') > max{uxRi{x =xr P),uxri{x' =xr x ”)), for all x , x ', x ” € 

U. 

In usual case, =xr relation is the A-level indiscernibility , we call also it A-level 
equivalencel^^, hence for Voi < /?, it also satisfies: 

=PR Q =aR • 

Thus we may define the lower approximation and upper approximation ofXCU 
with respect to the A-level rough equality relation =xr- 

Definition 3. Let A = (U,=xr) be an approximate space, where P is a 
nonempty finite universe of objects, =xr is a binary A-level rough equality rela- 
tion on U. The lower and upper approximate sets oi X CU from it are defined 
as follows: 

(X) = {xGU: C X}, 

and 

{X) = {xGU:[xU,nX^$}, 

respectively. 

Nakamura defined a rough grade modalities!^], he combined the common 
feature of three models of rough logic, fuzzy logic and modal logic. Stepa- 
niuk defined two structures of approximate first-order logic!^l, namely am = 
{IND,Ri,- - ,Rm,=) and Tm = (i?i, * * • , i?m)- The former has an equal rela- 
tion symbol =, hence it may give several similar results to classical first-order 
logic with equal conjunctive =. Lin and Liu defined a first-order rough logic 
via the neighborhood topological interior and closure used as two operators!^!. 
Yao and Liu defined a-degree true in generalized decision logic!^^], that is. 
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v{(f)) = a £ [0,1], we say that the formula 4> is a-degree true, we may move 
the A-level rough equality =xr into the logical systems^^’^’^’^’^^l and to allow us 
to create an axiomatic systems of A-level rough equality relation =\r. We may 
still inference to use the systems of relative approximate reasoning with the A- 
level rough equality =\r- Liu defined an operator ^ in rough logicl^’’^!. It is used 
in the rough logic, we put it in front of the formulas, which is called a degree of 
the formulas taking true. Therefore, we will move A-level rough equality relation 
conjunctive =\r into the rough logic, obviously we may obtain some properties 
and rough paramodulation inference rules about A-level rough equality =\r 

3 The Significance of A-level Rough Equality Relation in 
Rough Logic 



LetU = {p, g, • • ♦ , , o,/3, • • • , a:, • *}be a set of the language symbolst^’^’^’’^’^^], 

they are the propositional variables, well- formed formulas, terms and individual 
variable respectively. I is an interpretation function for the formulas, u\ri is an 
assignment function to the formulas in the interpretation I. 

For e E,uxri{4>) =xr ^ ^ [0, 1]- Let a, /3,7 be the term in the X*, A G [0, 1] 
is a given degree or A-level rough equality relation =xr- 
Property 2 
(!)• uxRi{o^ =XR cy) > A; 

(2) . If uxRi{a =XR 0) > A, then uxri{!^ =xr > A; 

(3) . If uxRi(a =XR /3) > A A uxRi{f3 =xr j) > A, then uxRi(a =xr 7 ) > 
uxRi{a =XR /3 A /3 =xr 7 ) > A; 

(4) . If uxRi(a =XR 0) > A, then uxri{p{' • • a * • •)) =xr "^xri{p{^ * • ^ • •))• 
Property 3 Let A = (t/, =xr) be an approximate space, A > 0.5, ^ G [0, 1],^ > A, 
x,p,z are individual variables and p(* * •) is an atom in the thus we have 
(!)• U'XRii^ix =XR ^)) = 

(2) . UxRl{^{x =XR y) V (1 - 0{y =XR x)) > max{uxRl{^{x =XR y)yUXRl{{^ - 

0(y =XR ^))) > 

(3) . uah/((1 - =XR y) V (1 - ^){y =XR z) V ^{x =xr z)) > max{uxRi{(l - 

0(x =XR y)),uxRi{{l - 0(y =\R z)),uxRi{^{x =XR z))) > 

(4) . If uxRiixi =XR '^XRiipi" ‘Xi-^)) =XR RxRi{p{* ' ‘ xq - then 

uxRiii^-Oi^i =\R ^o)V(l-0(p(* “^2 *“))V^(p(* “3^0 *••))) > max{uxRi{{l- 

Q(Xi =x Xo)),UxRl({l - OpO “Xi- ‘)),UXRl{^p{‘ “X 0 “ •))) > 

(5) . If uxRi{xi =XR a:o) RxRiifi^^'Xi--) =XR f{--xo'-)),then uxri{( 1 - 
Oi^i =\R ^ 0 ) V ^ifi'-Xi--) =XR /(-‘^o ••*))) > rnax{uxRT{{l ~ =XR 

Xo)). ^Afi(C(/(‘ • • OTi • • =XR f{--Xo- •)))) > C- 

To sum up, we may have the axiomatic set of rough logic with A-level rough 
equality relation =xr 
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where j = 1, • • • , n, / is a function term, aj and ao are a general term,!'^ is the 
axiom or theorem symbol in rough logic, it denotes the difference from h in the 
classical two valued logic. 

4 A-Level Rough Paramodulation and Reasoning 

Equality relation is a most basic and most often application in mathematics. 
The important character of equality relation is that it may be used in the A-level 
rough equality substitution in rough logic. The A-level rough equality relation 
=XR defined in the paper is introduced in the operator rough logic^^’^l, we may 
have the A-level rough equality of two formulas in the logic. 

Definition 4 Let (f> and 'ip be the two formulas in the operator rough logic^^’^l, 
A-level rough equality of (p and ip is defined as follows: 

(p=XR' 4 ^ iff uxrt{(P) =XRUxRi{'ip) iff \uxRl{(p) - UxRl{'fp)\ < e. 

Similarly, it can also obtain the relationship between A and e is 

A = 1 — e. 

Definition 5 Let C\ and C 2 be the clauses of no common variable^^’^L A > 
0.5, the structures of Ci and C2 are as follow: 

Cl : XiL[t] V Ci, for Ai > A V Ai < 1 - A; 

C 2 : A 2 (r =XR s) V for A 2 > A, 

where AiL[^] denotes to contain the rough literal^^’^^ of term t and C[ and 
are still clauses. If t and r have most general unifier <j,then we call 

U Cl'"' U 

a binary A-level rough paramodulation of Ci and C 2 , written by Pxr{Ci,C2), 
where AiL[^] and A 2 (r =xr s) are A-level rough paramoculation literal, where 
A* is defined as follows: 



A* = (Ai +A2)/2, for Ai > A, 
or 

A* = (1 -h Ai - A 2 )/ 2 , for Ai < 1 - A, 

and L^[s^] denotes the result obtained by replacing P of one single occurring in 
by 5 ^. 

Example 1. consider the following clauses, 



Ci: P{a)VQ(b), 

C 2 : a =xR b V R{b), 



where L is a literal, C[ is Q(6), P{a) may also be written L[a]; C 2 contains an 
equality literal a =xr b. Thus, L[o\ will be paramodulated as I/[6], it may also 
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be written as P{b), Hence, A-level rough paramodulation of Ci and C2 is a new 
clause 



C : P{b)WQ{b)W R{b) 
Example 2, Consider following the clauses 



Cl : P{g{f{x)))VQ{x), 

C2: f{g{b)) = XRaWR{g{c)), 

where L is P{g{f{x))), C[ is Q{x), r is f{g(b)), s is a; C!^ is R(g{c)), L contains 
the term f{x) that can be unified to r. Now, let t be f{x), a most general unifier 
of t and r is cr = hence, L^[t^] is P(g{f{g{b)))), L^[t^] is P{g{a)). 

Since is Q{g{b)) and C/ is R{g{c)). Thus, we obtain a binary A-level rough 
paramodulation of Ci and C2 is 



C:Pigia))wQ{gib))WR(gic)) 

In the example, two paramodulation literals are P(g{f(x))) and f{g(b)) =\r a. 
Definition 6. A-level rough paramodulation of Ci and C2 is one of the following 
binary A-level rough paramodulation: 

(1) A binary A-level rough paramodulation of Ci and C2] 

(2) A binary A-level rough paramodulation of Ci and a fact of C2; 

(3) A binary A-level rough paramodulation of a fact of Ci and C2] 

(4) A binary A - lavel rough paramodulation of a fact of Ci and a fact of C2- 
Theorem Let C\ and C2 be two clauses, A > 0.5, P\r{Ci , C2) is a binary A-level 
rough paramodulation of C\ and C2 , thus we have 



C,AC2^Pxr{CuC2). 

This theorem shows that given two clauses are A-level rough true^^^l, so is 
the result C 2 ) obtained by A-level rough paramodulation of Ci and C 2 )- 

Proof: No less general, let Ci is AiL[t] V C[ and u\rj{Ci) > A; C 2 is A 2 (r =\r 
5 ) VC 2 and u\Ri{C 2 ) > A, where A 2 > A. A mgu of t and r is cr, u\rj is an assign- 
ment function to the formulas in interpretation of the rough we 

need only to prove u\rj{Ci A C2) > A and uxrj{P\r{C 1^02)) > A respectively. 
uxRi{Ci A C 2 ) > A is obvious. Thus, we discuss only following two cases: 

(1) Ai > A. If xRi(L^[t^]) < 0.5, uxri{4>) < 0.5, it means that (f> is falsel^’^^’^di]^ 
due to =xR and =xr , so is uxri{L^[s^]) < 0.5 and uxri{Ci) > A. 

uxri{Pxr{CuC 2)) = uxRi{X^L^[sn V C/ V C/) > A. 

If y'XRi{L^[t^]) > 0.5, uxRi{<p) > 0.5, it means that <p is truel^’^^'^’^^1, de to 
=XR and =xr ^ so is uxri{L^[s^]) > 0.5 and A* > A by definition of 

A* the above. Hence, 

^xri{Pxr{Ci,C2)) = V Cl V C^) > A. 
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(2) Ai < A. Similar, we may also have u\ri{P\r{Ci^C 2 )) > A. Therefore, 



CiAC2->Pah(Ci,C2) 



The proof is finished- 

5 Conclusion 

A-level rough equality relation is used in rough logic, it will partition a grade 
for the approximate concept, namely it gives an approximate precision in ap- 
proximate theory. This rough equality relation conjunctive =\r is introduced in 
rough logic, we may have an axiomatic system with A-level equality =\r. It can 
be used in the rough reasoning, we introduce A-level rough paramodulation and 
its inference methods. In fact, it is a substitution by A-level rough equality or 
we call it A-level rough paramodulation reasoning. 

The further works in the paper will be to study the combining strategies be- 
tween A-level rough paramodulation and rough resolution^^’^’^l It is possible to 
generate some new resolution strategies and reasoning methods, which will raise 
the speed and efficiency of resolution algorithm running. 
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Abstract. In this paper, we propose an annotated logic program called 
an EVALPSN (Extended Vector Annotated Logic Program with Strong 
Negation) to formulate the semantics for a defeasible deontic reasoning 
proposed by D.Nute. We propose a translation from defeasible deontic 
theory into EVALPSN and show that the stable model of EVALPSN 
provides an annotated semantics for D.Nute’s defeasible deontic logic. 
The annotated semantics can provide a theoretical base for an automated 
defeasible deontic reasoning system. 

keywords : annotated logic, defeasible deontic logic, extended vector 
annotated logic program with strong negation, stable model 



1 Introduction 

Various non-monotonic reasonings are used in intelligent systems and the treat- 
ment of inconsistency is becoming important. VIoreover, it is indispensable to 
deal with more than tw'O kinds of non-monotonic reasoning having different se- 
mantics in such intelligent systems. If w'e consider computer implementation of 
such complex intelligent systems, w'e need a theoretical framew'ork for dealing 
with some non-monotonic reasonings and inconsistency uniformly. We take anno- 
tated logic programming as a theoretical framew'ork. Actually w'e have provided 
some annotated logic program based semantics for some non-monotonic reason- 
ings such as Reiter’s default reasoning, Dressier ’s non-monotonic AT VIS, etc. in 
ALPSN( Annotated Logic Program with Strong Negation) [7, 8] and Billington’s 
defeasible reasoning [10, 1] in VALPSN( Vector Annotated Logic Program with 
Strong Negation), which is a new version of ALPSN. 

In this paper, w'e provide a theoretical framew'ork based on EVALPSN 
(Extended VALPSN) for dealing with D.Nute’s defeasible deontic reasoning. 
We propose a translation from defeasible deontic theory into EVALPSN. By 
the translation, the derivability of the defeasible deontic logic can be translated 
into the satisfiability of EVALPSN stable models, and the derivability can be 
computed by the computation of the corresponding EVALPSN stable models . 
Therefore the stable models of EVALPSN can be an annotated semantics for the 

W. Ziarko and Y. Yao (Eds.): RSCTC 2000, LNAI 2005, pp. 470-478, 2001. 

© Springer- Verlag Berlin Heidelberg 2001 
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defeasible deontic logic and the annotated semantics can provide a theoretical 
foundation for an automated defeasible deontic reasoning system based on the 
translation and the computation of EVALPSN stable models. 

Defeasible logics are well-known formalizations of defeasible reasoning, how- 
ever, some of them do not have appropriate semantics. In [9], w'e have shown that 
Billington’s defeasible theory [1] can be translated into VALPSN and VALPSN 
can deal with defeasible reasoning. On the other hand, deontic logics are for- 
malizations of normative reasoning and have been developed as a special case of 
modal logic. Moreover, they are also focused on as tools of modeling legal argu- 
ment [12] and some applications to computer science have been realized [6, 5]. 
D.Nute introduced a defeasible deontic logic based on both defeasible logic and 
deontic logic, which can deal with both defeasible and normative reasoning [11]. 

2 VALPSN and EVALPSN 

We have formally defined VALPSN and its stable model semantics in [9]. After 
reviewing VALPSN briefiy, w'e describe the extended parts of EVALPSN. 

Generally, a truth value called an annotation is explicitly attached to each 
atom of annotated logic. Eor example, let p be an atom, ji an annotation, then 
is called an annotated atom. A partially ordered relation is defined on the 
set of annotations which constitutes a complete lattice structure. An annotation 
in VALPSN is 2 dimensional vector such that its components are non-negative 
integers. We assume the complete lattice of vector annotations as follows : 

7^, = |(or, y)| 0 < X < m, 0 < y < m, ar, y and m are integers }. 

The ordering of Tv is denoted in the usual fashion by a symbol Let = 
(xi.yi) and V 2 = (^ 2 ,^ 2 )- 

VI V 2 iff xi < X 2 and yi < ^ 2 - 

In a vector annotated literal p: (i,j), the first component i of the vector anno- 
tation (i,j) indicates the degree of positive information to support the literal p 
and the second one j indicates the degree of negative information. Eor example, 
a vector annotated literal p: (3, 2) can be informally interpreted that p is known 
to be true of strength 3 and false of strength 2. 

Annotated logics have tw'O kinds of negations, an epistemic negation (^) and 
an ontological negation(^). The epistemic negation follow'ed by an annotated 
atom is a mapping betw'een annotations and the ontological negation is a strong 
negation which appears in classical logics. The epistemic negation of vector an- 
notated logic is defined as the following exchange betw'een the components of 
vector annotations. 



^{P- (i-j)) = =p: (j,i). 

Therefore, the epistemic negation follow'ed by annotated atomic formulas can be 
eliminated by the above syntactic operation. On the other hand, the epistemic 
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one followd by a non-atom is interpreted as strong negation [2]. 

Definition 1 (Strong Negation, Let A be an arbitrary formula in VALPSN. 

^ A —def A (^(A —^A)a(A—^ A)). 

Definition 2 (VALPSN) If Lo, • • • , are vector annotated literals, 

Iji A * * * A Lf A ^ A * * * A ^ Ij^ — ^ 

is called a vector annotated logic program clause with strong negation(Y ALFSN 
clause). A VALPSN is a finite set of VALPSN clauses. 

We assume that all interpretations of a VALPSN P have a Herbrand base 
Sp(the set of all variable-free atoms) under consideration as their domain of 
interpretation. A Herbrand interpretation can be considered to be a mapping I : 
Bp — > Tv‘ Usually, I is denoted by the set {p: LJvj|/ |= p: A* • • Ap : vn}, where 

Uvj is the least upper bound of {vj^,. . . , vn}- The ordering is extended to 
interpretations in the natural w'ay. Let h and /2 be any interpretations, and A 
be an atom. 

h Af h =def (VA e Bp)(h(A) / 2 (A)). 

In order to provide the stable model semantics for VALPSN, we define a map- 
ping Tp betw'een Herbrand interpretations associated with every VALP (Vector 
Annotated Logic Program without strong negation) F. 

Tp(I)(A) = U{v I Fi A • • • A Bm. — A:v is a ground instance of 
a VALP clause in P and / |= Fi A • • • A Bm. 

where the notation U denotes the least upper bound. We define a special inter- 
pretation A which assigns the truth value (0,0) to all members of Bp. Then, 
the upw'ard iteration Fp f A of the mapping Tp is defined as : 

Tp t 0 = A 

Fp f A = Ua<AFp(Fp t a) for any ordinals a, A. 

Now w'e describe the Gelfond-Lifschitz transformation [3] for VALPSN. Let I be 
any interpretation and P a VALPSN. F^, the Gelfond-Lifschitz transformation 
of the VALPSN F with respect to /, is a VALP obtained from F by deleting 

1) each clause that has a strongly negated vector annotated literal ^ (U:v) in 
its body with / |= (C7:v), and 

2) all strongly negated vector annotated literals in the bodies of the remaining 
VALPSN clauses. 

Since F^ contains no strong negation, F^ has a unique least model that is given 
by Fpj f oj [4] 

Definition 3 (Stable Model of VALPSN) If / is a Herbrand interpretation of a 
VALPSN F, I is called a stable model of P iff I = Tpi f ca. 

The main difference betw'een VALPSN and E VALPSN is annotation and its 
lattice structure. An annotation of EVALPSN has a form of [(iff), p] such that 
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the first component (i, j) is a 2-dimentional vector as same as that of VALPSN 
and the second one is a kind of indices which expresses fact(o), obligation (/3), 
and so on. An annotation of EVALPSN is called an extended vector annotation. 
We assume the complete lattice T of extended vector annotations as follows : 
T = Tv X Td, where 

*7^ = {(a;,y)| 0 < a; < 3, 0 < y < 3} and = {±,a,/?,7 ,*i,* 2 ,* 3 ,T}, 

where x and y are integers. The ordering of Td is denoted by a symbol -<d and 
described by the Hasse’s diagram (cube) in Fig.l. The intuitive meanings of the 
members of Td are : ± (unknown), a (fact), j3 (obligation), 7 (not obligation), 
*1 (both fact and obligation), *2 (both obligation and not obligation), *3 (both 
fact and not obligation) and T (inconsistent). The diagram Td shows that 7^ is a 
trilattice in which the direction of a line 7/3 indicates deontic truths the direction 
of a line ±*2 indicates deontic knowledge and the direction of a line _La indicates 
actuality. The ordering of T is denoted by a symbol ^ and defined as follows : 
let [(u,ji),^i] and [(^2^ 72)5^2] be extended vector annotations, 

[(*i,ii),/Wi] ^ [(* 2 ,J 2 ),M 2 ] iff (*i,Ji) (* 2 ,i 2 ) and /xi // 2 - 

We provide an intuitive interpretation for some members of T. For example, an 
extended vector annotated literal p: [(3,0),o;] can be informally interpreted as 
“it is known that it is a fact that p is true of strength 3”, and q: [(0,2),^] can 
be also informally interpreted as “it is known that it is obligatory that q is false 
of strength 2. 

There are two kinds of epistemic negation, -11 and -<2, in the extended vector 
annotated logic, which are regarded as mappings over Ty and 7^, respectively. 
Definition 4 (Epistemic Negation of EVALPSN) 

-■2([(«,i),-L]) = -^2{[{i,j),a\) = [{i,j),a\, 

= [(*,i),/7], ^2([(*,i),7]) = 

= [(*,i),*3], -'2([(*,i),*2]) = 

-^2([(*,i),*3]) = ~'2([(*,i),T]) = [{i,j),T]. 




0 



1 2 3 



T 




Fig. 1. 
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We can eliminate syntactically the epistemic negation followed by annotated 
atom based on the definition. We can also define the strong(ontological) negation 
i^s) in EVALPSN by the epistemic negations as w'ell as Definition 1. The 
formal interpretations of the epistemic negations and the strong negation can be 
defined as w'ell as the case of ALPSN [7, 2]. 

Definition 5 (EVALPSN) If Lo, • • • , are extended vector annotated literals, 
L/i A * * * A Tf A A * * * A ^ 

is called a extended veetor annotated logie program elause with strong nega- 
tion(EV ALFSN clause). An EVALPSN is a finite set of EVALPSN clauses. 

3 Defeasible Deontic Logic 

We introduce D.Nute’s defeasible deontic logic in [11]. A literal is any atomic 
formula or its negation. If is a literal, then Qcp and ^ Q(p are deontic for- 
mulas. All and only literals and deontic formulas are formulas. If is a literal, 
(f) and ^ (f) are called the complements of each other, and Q(f) and ^ Q(f) are 
called the complements each other. denotes the complement of any formula 
(f), positive or negative. Rules are expressions distinct from formulas. Rules are 
constructed by using three primitive symbols : — and If A U is a set 
of formulas, A ^ cp is a> striet rule, A ^ cp is a> defeasible rule, and A ^ is an 
undereutting defeater. In each case, w'e call A the antecedent of the rule and (p 
the consequent of the rule. If A = {^}, w'e denote A ^ cp as pj ^ (p, and similary 
for defeasible rules and defeaters. Antecedents for strict rules and defeaters must 
be non-empty, and antecedents for defeasible rules may be empty. We call a rule 
of form $ ^ (p a presumption and represent it more simply as ^ (p. All rules are 
read as fif-then’ statements. We read A ^ as “If A, then evidently (normally, 
typically, presumably) , w'e read ^ as “Presumably, , and w'e read A ^ 
as “If A, it might be that (p*'\ The role of defeater is only to interfere with the 
process of drawing an inference from a defeasible rule. Defeaters never support 
inferences directly although they may support inferences indirectly by undercut- 
ting potential defeaters. 

Definition 6 (Defeasible Theory) A defeasible theory is a quadraple (F, R, C, ^ 
) such that F is a set of formulas (intuitively F can be regarded as a set of facts), 
F is a set of rules, C is a set of finite sets of formulas such that for every formula 
(p, either G C or G C, or {cp,->(p} G C, and ^ is an acyclic binary 

relation on the non-strict rules in F. The members of C are called eonfliet sets. 

The ultimate purpose of the set of conflict sets is to determine sets of com- 
petiting rules. We introduce some notations. There are four kinds of defeasi- 
ble consequence relation : strict derivability(l-), strict refutability (H), defeasible 
derivability(l^), and defeasible refutability(^). In Nute[ll], firstly he presented 
a defeasible logic which includes deontic operators in its language, then in order 
to provide a defeasible deontic logic he extended the proof theory for the original 
defeasible logic to include some inference conditions, deontic inheritance, deontic 
detachment, and so on. We introduce only some of the inference conditions. 
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Defioitioo 7(DSD-proof) A DSD -proof in a defeasible theory (F, F, C, is a 
sequence a of defeasible assertions such that for each k <l{a), one of inference 
conditions [M+], [DM+], [DM“], [E+], [SS+ ], [DSS+], [DSD+], [DDD+] 
and [DSD—] holds. A defeasible theory having DSD-proof is called a defeasible 
deontie theory. 

[M+] ak = T h (f) and either (p € F or there is A ^ (p G R such that F h A 
succeeds at ar . 

[DM+] ak = T Q(p and there is A ^ (p E R or A ^ Q(p E R such that 
T h succeeds at at. 

[DM“] ak=TE(P, 

2. for each A ^ cp E Rj T ^ A succeeds at and 

3. if (^ = 0^7 then for each A^p;EA or A^ E Rj T E QA succeeds 
at ak‘ 

[E+] ak = T (p and F h succeeds at at. 

[SSt,] ak = T (p and there is A ^ cp E R such that 

1. F 1^ A succeeds at ar , 

2. for each cp Xt there is pj E such that F H ^ succeeds at ar , and 

3. for each (p'x.r and C d-covering in F such that every rule in Cr is 

strict, either 

(a) there is a literal pj E and B ^ ip E Cr such that F ^ F succeeds 
at <j /j , 

(b) there is ^ ^ad B ^ ip E Cr such that F ^ QB succeeds 
at ak, or 

(c) there is Qip E (7^ and B — > Qip E Cr such that both T ^ B and 
F ^ QB succeed at at. 

[DSS+], [DSD+], [DDD+] and [DSD-] can be found in [11]. 

4 From Defeasible Deontic Theory into EVALPSN 

We assume the following correspondence as a basis of a translation from a de- 
feasible deontic theory into an EVALPSN. 

[Assumption] Let F = (F,R,C,Q be a defeasible deontic theory, / be the 
stable model of an EVALPSN which is the translation of F, and be a literal. 



F 




iff 


J|=^ 


[(3,0), a], 


F 




iff 




[(3,0), a]. 


F 




iff 


J|=^ 


[(3,0),/?], 


F 




iff 




[(3,0),/?], 


F 


h~ O'^ 


iff 


J|=^ 


[(3,oV]> 


F 


H~ O')* 


iff 




[(3,oV], 


F 




iff 


J|=^ 


[(2,0), a]. 


F 


H <t> 


iff 




[(2,0), a]. 


F 


h O'^ 


iff 


J|=^ 


[(2,0),/?], 


F 


H 0<l> 


iff 




[(2,0),/?], 


F 


0'^ 


iff 


J|=^ 


[(2,0), 7], 


F 


H~ 0(t> 


iff 




[(2,0),t]. 



It is considered that the negation ^ follow'ed by a literal (p is translated to the 
epistemic negation -ii in EVALPSN and the negation ^ follow'ed by a deontic 
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operator Q is translated into the epistemic negation ^2 as well. The intuitive 
meanings of some translated EVALPSN clauses are as follows : 

(f ) : [(3,0), a] (f) is strictly derivable as a fact, 

(f ) : [(2,0),/3] C)(f) is defeasibly derivable, 

(f ) : [( 1 , 0 ), 7 ] ^C)(t> is not defeasibly derivable, 

(f) : [( 0 , 0 ), a] (f) is unknown as a fact. 

/ 1= : [( 1 , 0 ), 7 ] expresses that the antecedent of a rule in which the consequent 

is ^ Q(f) is derivable but the consequent is not derivable because the rule is 
defeated by other conflicting rules. The vector annotation (1,0) indicates not 
only the refutation of (p but also a defeated evidence of defeasible reasoning and 
w'e will utilize such information for inductive defeasible deontic reasoning in the 
future - 

Based on the assumption, the translation rule is given for facts, obligations, 
and four kinds of rules, A ^ (p^ A ^ Q(p^ A ^ (p and A ^ Q(p. Since defeaters 
A^ (p or A ^ Q(p cannot be used to derive the consequents (p or Qcp, they do 
not need to be translated. We describe the translation only for facts and strict 
rules. 

[Translation Rule] Let T = {F^R^Cj ^) be a defeasible deontic theory, and (p 
and p) be literals. 

[1] for (pjC)(pj^C)(p € F, they are translated into 

[(3, 0), a], [(3, 0), /3] and [(3, 0 ), 7 ], respectively. 

Let A = {ai, 0 ® 2 } and B = { 6 i,Ofe} for simplicity. 

[2] for r = A ^ (p e R. Prom [M+]. Suppose 3A ^ cp e R. Since T A implies 
T (p, the rule r is translated into 



ai : [(3, 0), a] A U 2 : [(3, 0)J3]^<P: [(3, 0), a]. 



From [DM+j. Suppose 3A ^ cp € R. Since T h QA implies T h Qcp, the rule r 
is translated into 



ai : [(3, 0), /3] A U 2 : [(3, 0), /3] ^ [(3, 0), /3j. 

From [SS^q]. Suppose 3^p E (7^ and 3B p) E Cr. Prom 3. (a), since F A, 
F H ^ and T ^ B imply F the rule r is translated into 

ai:[(2,0),a] Aa 2 :[( 2 , 0 ),/?]A-,^:[( 3 , 0 ),a]A 61 : [(2, 0), a] ^ [(2, 0), a] 

ai:[(2,0),a] Aa 2 :[( 2 , 0 ),/?]A-,^:[( 3 , 0 ),a]A 62 : [(2, 0), ^ [(2, 0), a]. 

Suppose 3 Q} p) E and 3B p) E Cr. Prom 3.(b), since F A, F H ^ and 
F ^ QB imply F the rule r is translated into 



ai:[(2,0),a] Aa 2 :[( 2 , 0 ),/?]A^,^:[( 3 , 0 ),a]A 61 : [(2,0), ^ [(2,0), a] 

ai:[(2,0),a] Aa 2 :[( 2 , 0 ),/?]A^,^:[( 3 , 0 ),a]A 62 : [(2,0), ^ [(2,0), a]. 
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Suppose 3 Q E and 3B — > E Cr. From 3.(b), since T T 3 ^p^ 

T ^ B and T ^ QB imply T (p, the rule r is translated into 

ai:[( 2 , 0 ),a] Aa 2 :[( 2 , 0 ),/?]A-,^:[( 3 , 0 ),a]A-, 6 i:[( 2 , 0 ),a]A-,^^^ 

<l>‘ [(2, 0), a] 

ai : [(2, 0), a] A U 2 : [(2, 0), /?] A ^ : [(3, 0), a] A 6 i : [(2, 0), a] A ^ 

<P' [( 2 , 0 ), a] 

ai:[( 2 , 0 ),a] Aa 2 :[( 2 , 0 ),/?]A-,^:[( 3 , 0 ),a]A-, 6 i:[( 2 , 0 ),/?]A-, 62 :^ 

<l>‘ [(2, 0), a] 

ai : [(2, 0), a] A a 2 : [(2, 0), /?] A ^ : [(3, 0), a] A 62 : [(2, 0),/3]^p>: [(2, 0), a] . 

We take an example of defeasible deontic theories from [ 11 ] called Chisholm 
Example. Then we translate it into an EVALPSN and compute the stable model 
of the EVALPSN. 

Example ( Chisholm Example ) We consider the following situation. 

1. Jones ought to visit his mother. 

2. If Jones visits his mother, he ought to call her and tell her he is coming. 

3. It ought to be that if Jones does not visit his mother, he does not call her 
and tell her he is coming. 

4. In fact, Jones does not visit his mother. 

If we formalize 1-4 as a defeasible deontic theory, it includes : 

1. 2. a ^ 0<^ 3. 4. 

The above formulas 1—4 are translated into EVALPSN clauses according to the 
translation rules. ^ a are translated into a: [(3,0),/?] and a: [(0,3), a], 

respectively. The defeasible rule a ^ 0<^ is translated into 

a : [(2, 0), a] A c: [(0, 3), /?] A a : [(0, 2), a] ^ c: [(2, 0), /?] 

a : [(2, 0), /?] A c: [(0, 3), /?] A a : [(0, 2), a] ^ c: [(2, 0), /?] 

: [(2, 0), /?] A c: [(0, 3), a] A : [(0, 2), /3] ^ c: [(2, 0), p] 

V : [(2, 0), a] A c: [(0, 3), p]Av: [(0, 2), a] ^ c: [(1, 0), p] 

V : [(2, 0), /?] A c: [(0, 3), 0]Av: [(0, 2), a] ^ c: [(1, 0), 0] 

V : [(2, 0), /?] A c: [(0, 3), a] A : [(0, 2), a] ^ c: [(1, 0), p] 

V : [(2, 0), p]A c: [(0, 3), a] A : [(0, 2), ^ c: [(1, 0), p] 

The defeasible rule ~ a => O ~ c is also translated into 

V : [(0, 2), a] A c: [(3, 0), fJ]A a : [(2, 0), a] ^ c: [(0, 2), fj] 

V : [(0, 2), /?] A c: [(3, 0), /?] A a : [(2, 0), a] ^ c: [(0, 2), /3] 

a : [(0, 2), p]A c: [(3, 0), a] A a : [(2, 0), p]^c: [(0, 2), p] 

V : [(0, 2), a] A c: [(3, 0), 0]Av: [(2, 0), a] ^ c: [(0, 1), 0] 

V : [(0, 2), /?] A c: [(3, 0), /3]Av: [(2, 0), a] ^ c: [(0, 1), /3] 

V : [(0, 2), /?] A c: [(3, 0), a] A a : [(2, 0), a] ^ c: [(0, 1), p] 

V : [(0, 2), p]A c: [(3, 0), a] A a : [(2, 0), p] ^ c: [(0, 1), p] 
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Then the EVALPSN has only one stable model 

{v:[(3,0),/3], i;:[(0,3),a], c: [(0, 2), /3], c: [(1, 0), /3] }, 
which says that T h ^ T c and T ^ Qc. 

5 Conclusion 

In this paper, w'e have provided an annotated semantics for Nute’s defeasible 
deontic logic. That is one of theoretical framew'orks for intelligent reasoning 
systems based on annotated logic programming. What w'e have done in this paper 
includes to provide a theoretical framew'ork for automated reasoning systems of 
defeasible deontic logic. Since inference conditions in the defeasible deontic logic 
are too complicated, having an automated reasoning system for the defeasible 
deontic logic is quite useful and convenient. Actually, w'e have implemented an 
EVALPSN based automated reasoning system for the defeasible deontic logic. 

References 

1. Billington,D. : Conflicting Literals and Defeasible Logic. Proc. 2nd Australian Work- 
shop on Commonsense Reasoning, (1997) 1-15 

2. Da Costa,N.C.A., Subrahmanian,V.S., and Vago,C. : The Paraconsistent Logics PT. 
Zeitschrift fiir MLGM 37 (1989) 139-148 

3. Gelfond,M. and Lifschitz, V. : The Stable Model Semantics for Logic Programming. 
Proc. 5th Int4 Conf. and Symp. on Logic Programming (1989) 1070-1080 

4. Lloyd, J.W. : Foundations of Logic Programming(2nd edition) . Springer (1987) 

5. Meyer, C.J. and Wieringa,J.R. : Deontic Logic in Computer Science. John Wiley V 
Sons (1993) 

6. McNamara,P. and Prakken,H.(eds.) : Norms ^ Logics and Information Systems. New 
Studies in Deontic Logic and Computer Science, Frontiers in Artiflcial Intelligence 
and Applications Vol.49 lOS Press (1999) 

7. Nakamatsu, K. and Suzuki, A. : Annotated Semantics for Default Reasoning. Proc. 
3rd Paciflc Rim International Conference on Artiflcial Intelligence (1994) 180-186 

8. Nakamatsu,K. and Suzuki, A. : A Nonmonotonic ATMS Based on Annotated Logic 
Programs. Agents and Multi- Agents Systems, LNAI 1441 Springer (1998) 79-93 

9. Nakamatsu,K.and Abe, J.M. : Reasonings Based on Vector Annotated Logic Pro- 
grams. Computational Intelligence for Modelling, Control V Automation, Concurrent 
Systems Engineering Series 55. lOS Press (1999) 396-403 

10. Nute,D. : Basic Defeasible Logics. Intensional Logics for Programming. Oxford 
Uniyersity Press (1992) 125-154 

11. Nute,D. : Apparent Obligation. Defeasible Deontic Logic. Kluwer Academic Pub- 
lisher (1997) 287-316 

12. Prakken,H. : Logical Tools for Modelling Legal Argument A Study of Defeasible 
Reasoning in Law. Law and Philosophy Library Vol.32 Kuwer Academic (1997) 




Spatial Reasoning via Rough Sets 



Lech Polkowski 

Polish- Japanese Institute of Information Technology 
Koszykowa 86, 02-008 Warsaw, Poland 
and 

Department of Mathematics and Information Sciences 
Warsaw University of Technology 
PL Politechniki 1,00-650 Warsaw, Poland 
e-mail : polkowOp j wstk . waw . pi 



Abstract. Rough set reasoning may be based on the notion of a part 
to a degree as proposed in rough mereology. Mereological theories form 
also a foundation for spatial reasoning. Here we show how to base spatial 
reasoning on rough-set notions. 

Keywords: rough sets, mereology, rough mereology, connection, spatial 
reasoning 



1 Introduction 

For expressing relations among entities, computer science has two basic langua- 
ges: the language of set theory, based on the opposition element-set, where enti- 
ties are considered as consisting of distinct points, and languages of mereology, 
based on the opposition part-whole, for discussing entities continuous in their 
nature.^ Mereological theories active nowadays go back to ideas of S. Lesniewski 
and A. N. Whitehead; mereological theory of Lesniewski is based on the notion 
of a part and the notion of a (collective) class cf. [5]. Mereological ideas of Whi- 
tehead were formulated as Calculus of Individuals and were expressed in terms 
of connection cf. [1]. ^ In [13] a new paradigm for approximate reasoning, rough 
mereology, has been introduced. Rough mereology is based on the notion of a 
part to a degree and thus falls in the province of part-based mereologies.^ In this 
paper we introduce rough mereology in the ontological universe of Lesniewski 
[5] (cf.[8], [15], [16]). We define (Tech quasi-topology [6] and we apply it in a 
study of connections.^ We introduce some notions of connection viz. the limit 
connection Ct and graded connections Cq, and we study their properties. In 

^ For instance. Spatial Reasoning relies extensively on mereological theories of part cf. 

[ 11 ] 

^ Mereology based on connection gave rise to spatial calculi based on topological no- 
tions derived therefrom (mereotopology) [3], [11]. 

^ Rough mereology inherits the general idea of approximations by means of member- 
ship functions from rough set theory [12]. 

^ A quasi-topology was introduced in the connection model of mereology [1] under 
additional assumptions of regularity. 
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particular, we demonstrate that they induce the same notion of an element as 
the original mereology. We also discuss the case of distributed reasoning systems 
in which connections in potentially infinite information systems may be studied 
and external connections may be effected in a simple way Our results show that 
rough mereology offers an inference mechanism based on connection applicable 
e.g. to spatial reasoning. ^ The reader will find all notions and bibliography in 
[14] 

2 Preliminaries 

We introduce consecutively basic notions of ontology, mereology and rough me- 
reology 



2.1 Ontology 

We adopt Ontology introduced by Lesniewski [5] based on the primitive notion 
of a copula is denoted £ whose meaning is expressed by 

The Ontology Axiom XsT ^ dZ.ZeX A V/7, W.{UeX A WeX ^ UeW) A 
yZ.{ZeX ^ ZeY).^ 

We have thus a mechanism to discuss and understand statements like X is 
Y without resorting to intuition. The reader will observe that X is Y (formally 
written down as XeY) is analogous to the formula X G T if we know that X is 
a singleton, say X = {a} for some a. In ontological setting we do not need to 
know this: it is encoded in the ontological axiom. 



2.2 Mereology 

Mereology is a theory of collective classes i.e. individual objects representing 
distributive classes (names). We adopt the notion of a part as the primitive 
notion of mereology cf. [5], [10]. We assume that the ontological copula £ is 
given and that the Ontology Axiom holds. A predicate pt of part satisfies the 
following conditions. 

(MLl) Xept{Y) X£X A Y eY (X, X are individual entities). 

(ML2) Xept{Y) A Yept{W) Xept{W) {pt is transitive). 

(ML3) yX.{^{Zept{Z))) {pt is non-reflexive). 

On the basis of the notion of a part, we define the notion of an element as a 
predicate el (originally called an ingredient ) . 

^ Rough set ideas have already been applied explicitly in Spatial Reasoning e.g. in 
fields of multi-resolution data management, epistemology of rough location cf. [14] 
and in ” egg- yolk” representation of vague regions [4]. 

® The meaning of XsY can be made clear now: shortly, X is an individual (i.e. any 
t/, WsX are such that UsW (and vice versa)) and this individual is Y, In particular, 
XsX holds iff X is an individual object. 
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Definition 2.1 X£el{Y) ^ X£pt(V) W X =Y7 

A fundamentally important component of Mereology is the class functor^ Kl^ 
intended to make distributive classes (names) into collective classes (individu- 
als).® 

Definition 2.2 

X£KI{y) ^ dz.zey A yz.{Z£Y Z£d{x)) A yzxz£d{x) 

3U, W.{IJ£Y A W£d{U) A W£d{Z))).^ 

We now recall mereology based on the notion of a Connection. 



2.3 Connection 

This approach cf. [1] is based on the functor C of being connected^ 

Definition 2.3 

(Cl) X£C{Y) ^ X£X A TeT {X,Y are individuals). 

(C2) X£C[X) (refiexivity). 

(C3) X£C{Y) ^ Y£C{X) (symmetry). 

(C4) \dZ.{Z£C{X) ^ Z£C(Y))] ^ {X = Y) (extensionality).ii 

lYom (7, other functors are derived; we recall basic for topological issues 
functors (TP), (NTP). 

Definition 2.4 

(O) X£0{Y) ^ 3Z{Z£dciX) A Z£dc{Y)) {X and Y overlap). 

(EC) X£EC{Y) ^ X£C{Y) a ^{X£0{Y)) {X is externally connected to Y). 
(TP) X£TP(Y) ^ X£ptc{Y)A3Z{Z£EC{X)AZ£EC{Y)) {X is a tangential 
part of y). 

(NTP) X£NTP{Y) ^ X£ptc{Y) A ^{X£TP{Y)) {X is a non-tangent ial part 
of y).i2 



2.4 Rough Mereology 

Rough mereology is based on the predicate of being a part to a degree rende- 
red here as a family pr called a rough inclusion where r G [0,1]. The formula 
X£priX) reads X is a part of Y to degree at least r. We assume an ontology 
of £ and a mereology inducing a functor el. Basic postulates of rough mereology 
are as follows. 

^ We recall that A, Y are external to each other, in symbols Xeext(Y) when there is 
no y with Zsel{X) and ZseliY). 

® Mereology we adopt here is thus the classical (maximal) mereology cf. [11]. 

^ Thus, Kl(Y) is an individual containing as elements all individuals in Y and such 
that each of its elements has an element in common with an individual in Y ; notice 
obvious analogies to the union of a family of sets in set theory. 

For the uniformity of exposition sake, we will formulate all essentials of this theory 
in the ontological language applied above. 

Notice that some schemes dispense with extensionality cf. [3], [11]. 

Connection allows for topological notions cf. [1], [11]. 




482 L. Polkowski 



(RMO) 3r(X£/x^(y)) ^ XeX A (X and X are individuals). 

(RMl) XeiJbiiY) Xeel{Y) (being a part to degree 1 is equivalent to being 
an element). 

(RM2) Xs}jbi{Y) yZ.{Ze}jbr{X) Zefiriy)) (monotonicity in the object 
position). 

(RM3) X = y A Xs(ir{Z) => YeiJbr{Z)) (the identity is a /x-congruence) . 
(RM4) X£/x^(y) A s < r X£/Xg(y) (the meaning a part to a degree at least 



3 Mereotopology 

We now are concerned with topological structures arising in a mereological uni- 
verse endowed with a rough inclusion For a different approach where 

connection may be derived from the axiomatized notion of a boundary see [16]. 
We show that in this framework one defines Cech quasi-topologies. 

3.1 Mereotopology : Cech topologies 

Here, we induce a Cech quasi-topology in any rough mereological universe. 
Definition 3.1 (i) We introduce the name M^X for the property expressed by 
pr with respect to X i.e. ZeMrX ZejiriX)] (h) We let ZeKlr{X) 
ZeKl{MrX), 

Thus Klr{X) is the class of objects having the property /x^(X). We define a 
functor int. 

Definition 3.2 We define a name /(X) by letting 

ZeI{X) ^ 3s < l\Kls{Z)eel{X)) 

and we let int{X) = X/(/(X)). 

Then we have the following properties of int cf. [14]. 

Proposition 1. (i) int{X)eel{X)] (ii) Xeel{Y) int{X)eel{int{Y)) . 

Properties (i)-(ii) witness that int introduces a Ceeh quasi-topology. We de- 
note it by the symbol 

It follows that pi coincides with the given el establishing a link between rough 
mereology and mereology while predicates pr with r < 1 diffuse el to a hierarchy 
of a part in various degrees. The reader may use as an archetypical rough inclusion 
the Lukasiewicz measure /x(X, T)= X, y being non-empty finite sets in 

a universe U. 

As recalled in Section 2.3 topological structures may be defined within the connec- 
tion framework via the notion of a non-tangential proper part. The predicate of 
connection allows also for some calculi of topological character based directly on 
regions e.g. RCC calculus [2] 

A quasi-topology is a topology without the null element (the empty set). 

Recall that a Cech topology [6] is a closure structure in which the closure operator 
cl satisfies the following (i) cW = 0 (ii) X C clX (in) X C y =>* dX C dY, so the 
associated (by duality intX = U — d(U — X)) Cech interior operator int should only 
satisfy the following: xnt0 = 0; intX C X; X C y =>* intX C intY. 
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4 Connections from Rough Inclusions 

In this section we investigate some methods for inducing connections from rough 
inclusions. 



4.1 Limit Connection 

We define a functor Ct as follows. 

Definition 4.1 XeCt{Y) ^ -(dr, s < l.ext{Klr{X),Kls{Y)))}^ Clearly, 
(C1-C3) hold with Ct irrespective of /r. For (C4), consult [14]. 

4.2 Prom Graded Connections to Connections 

We begin with a definition of an individual BdrX. 

Definition 4.2 where Z£/x+(X) Z£(j.r{X) A —(ds > 

r.ZsfisiX)). 

We introduce a graded (r^ s)- connection C(r, s) (r, s < 1) via 
Definition 4.3 X£C{r, s){Y) ^ 3W.W£el{BdrX) A W£el{BdsY). 

We have then 

Proposition 2. (t) X£C{l,l){X)] (tt) X£C{r, s){Y) ^ Y£C{s,r){X). 

Concerning the property (C4), we adopt here a new approach. It is valid from 
theoretical point of view to assume that we may have ’’infinitesimal” parts i.e. 
objects as ’’small” as desired. 



Infinitesimal pairts model We adopt a new axiom of infinitesimal parts 
(IP) -(X£e/(y)) ^ Vr > 03Z£el{X),s< r.Z£/r+(y). 

Our rendering of the property (C4) under (IP) is as follows: 



Proposition 3. 



-(X£e/(y)) 



Vr > O.dZ, s > r.ZeCil, 1)(X) A Z£C{l,s)(Y). 



Connections from Graded Connections Our notion of a connection will 
depend on a threshold, o, set according to the needs of the context of reasoning. 
Given 0 < o < 1, we define a functor as follows. 

Definition 4.4 XeCa{Y) dr, s > a.XeC{r^ '^)(y). 

Then the functor Cq, has all the properties of a connection: 

Proposition 4. (IP) For any a : (i) XeCa^X)] (ii) X£Ca{Y) Y£Ca{X)] 
(in) X^Y ^ 3Z,{ZeCc,{X) A -(Z£C«(T)) V ZeCc,{Y) A -(Z£C«(X))). 

See Section 2.3 for basic notions related to mereological theories based on the notion 
of a connection. 

Thus, X and Y are connected in the limit sense whenever they cannot be separated 
by means of their open neighborhoods. 

Cf. an analogous assumption in mereology based on connection [11]). 
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5 Examples 

In this section, we will give some examples related to notions presented in the 
preceding sections. 



5.1 Concerning case of points 

Our universe will be selected from a quad-tree in the Euclidean plane formed 
by squares [A: + A: + x where A:, I G Z,i, j = 0 , 1, 2® — 1 

and s = 1, 2, ... 

The choice of points will depend on the level of granularity of knowledge we 
assume. We assume that our sensor system perceives each square X as the 
square whose each side length is that of X plus 2a where a = 2“® for some 
s > 1 (we then express uncertainty of location applying ’hazing” of objects cf. 
[18]). We restrict ourselves to squares with side length at least 4a (as smaller 
squares may be localized with a too high uncertainty). We let for simplicity 
4a = 1 so points are squares of the form [A:, A: + 1] x [/, / + 1] , A:, / G Z. We define 
jj^r by letting 

Xei^AY) ^ > r 

where X^Y^ are enlargements of X, Y defined above and A is the area (Le- 
besgue) measure in the two-dimensional plane. We may check straightforwardly 
that 

Proposition 5. Functors jj^r satisfy (RMF)-(RM5). 

Applying our notion of the connection Cg defined in Section 4, we find that 
two adjacent squares (e.g. X = [0, 1] x [0, 1], T = [0, 1] x [1,2]) are connected in 
degree 0.3(3) (i.e. X£(7o,3(3)(T)) while two squares having one vertex in common 
(e.g. X = [0,1] X [0,1] and Y = [1,2] x [1,2]) are connected in degree 0.1(1) 
(i.e. X£(7 o,i(i)(^))- Pairs of disjoint squares are connected in degree 0 at most. 
Observe that in this case (7 q, with a > 0 is a connection even if (IP) does not 
hold. 



5.2 Connections in Distributed Systems 

We refer to a model for approximate synthesis in a distributed system Ma propo- 
sed in [13] Reasoning in Ma goes by means of standards and rough inclusions 

Z denotes the set of integers. 

A point is an object X with the property that Y eel[X) Y = X. 

We recall briefly its main ingredients. Consider a distributed (multi-agent) system 
Ma= {Ag, Link, Inv} where A^r is a set of agents, Link is a hnite list of words over Ag 
and Inv is u set of inventory objects. Each t in Link is a word agiag 2 ---aguag meaning 
that ag is the parent node and agiag 2 ...agk are children nodes in an elementary team 
t; both parties are related by means of the operation ot which makes from a tuple 
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jj^ag at any ag. Instrumental in the reasoning process are rough connectives 
where t G Link and a =(sti, st) is admissible. They propagate rough in- 
clusion values from children nodes to the parent node according to the formula 
Vi (st) . 

We assume that rough connectives / are cofinal i.e. for any r < 1 there are 
< 1 with > r (this assumption clearly implies some form 

of potential infinity in data tables). Then Ct is preserved in the sense 

Proposition 6. \ii.XieCT^T{sti) ot{xi^ ..^Xk)£CT^T{st), 

Similarly, one has a minimax formula for 

Proposition 7. •••, Z/fc)) 

where F(ai,..,afc) = in/^_i{/(ri, .., r^), /(si, ..., s^) : > 



Case of Ct Consider a system Ma where Ag={agQ} U 
consists of words of the form agLagf‘^...agl^agf with ii <^2 ^ •••• ^ i-6. 

any agent agf of the level i may form a local team with agents of lower levels 
ii, ^ 2 , ••••, For any agent ag = agf of the level the universe Uag consists of 
points i.e. squares of side length 2“^ and of classes of squares sent by children 
agents in local teams with the head ag. Thus, any agent has potentially infinite 
universe. In particular, the agent ag^ has in its universe all points of side length 1 
as well as all classes (unions) of finite collections of squares of smaller size sent by 
lower level agents. In this context, one may define a connection Ct- We define the 
rough inclusion gag af any agent ag by the formula: X G gagTO^) ^ ^ 

thus el is Then one checks in a direct way that two squares whose union 
is connected topologically (e.g. X = [0,1] x [0,1],T = [1,2] x [0,1]) satisfy 
the formula X G Ct{Y) while for X, T with Kl{X^Y)=X UT not connected 
topologically we have -i(X G CV(T)).^^,^^ 

Acknowledgement This work has been prepared under the grant no 8T11C 
02417 from the State Committee for Scientific Research (KBN) of the Republic 
of Poland. 

(xi, Xk) of objects, resp. at agi,..,agk the object ot(xi, Xk) at ag; the tuple 
(xi, ...,Xk,ot(xi, ..,Xk)) is admissible. Leaf agents Leaf are those ag which are not 
any parent node. They operate on objects from Inv. Each agent ag is equipped with 
an information system Aag= (Uag^Aag) and a rough inclusion pap on Uag', a set 
Stag ^ Uag of Standard objects is also defined for any ag. 

These results form a basis for distributed connection calculi to be explored in a future 
research. 

I.e. we apply the formula from Section 5.1 but without hazing. 

Thus, in distributed environments it is possible to define naturally an external con- 
nection EC (i.e. some objects may connect to each other without overlapping) even 
with objects simple from the topological point of view. 

Observe that (IP) holds in this case. 
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Abstract. This paper shows that the problem of decomposing a hnite 
function f(A,B) into the form h(g(A), B), where is a Boolean func- 
tion, can be resolved in polynomial time, with respect to the size of the 
problem. It is also shown that omission of the characteristic of the g 
function can signihcantly complicate the problem. Such a general prob- 
lem belongs to the VP-hard class of problems. The work shows how the 
problem of decomposition of a hnite function can be reduced to the prob- 
lem of coloring the vertices of a graph. It is also shown that the problem 
of decomposition of relations can be reduced to coloring the vertices of 
their hypergraphs. In order to prove the validity of the theorems, com- 
binatory properties of Helly are used. 



1 Introduction 



Decomposition methods of hnite functions have wide application in many areas 
of computer science. One such area is artihcial intelligence, in particular, ma- 
chine learning, logic synthesis and image processing. In machine learning and 
image processing decomposition methods can be used in the descavering process 
of some hidden properties of the data [13], [7]. These methods can also be used 
in the process of compressing decision rules [10], [8]. In logic synthesis, decom- 
position methods are very effective in FPGA/PLA based digital circuit design 
[5], [10], [6]. Considering the wide variety of applications of functional decom- 
position, it becomes necessary to ponder over the computational complexity of 
certain decomposition problems and the effectiveness of the algorithms used. 
Apart from these analytical issues, there are other questions like: how general 
are the existing algorithms? and can they be further generalized? This work 
deals with decomposition of relations, which is generalization of the problem of 
functional decomposition. This problem was briefly discussed in the paper [6]. 
In order to justify the answers to the questions posed above, we have to use 
combinatorics and Helly ’s properties. 

W. Ziarko and Y. Yao (Eds.): RSCTC 2000, LNAI 2005, pp. 487-494, 2001. 
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2 Basic Principles 

Graph G = (V^ E) is a pair of sets V and E^ where V is the given set, and 
E is the binary relation defined on V. Members of set V are called vertices 
and members of set E are called edges. A stable set^ or independent seh is a 
subset of vertices with the propert that no two vertices in the stable set are 
adjacent. Complete and correct coloring of vertices of a graph G is understood 
as completely defined function c : V ^ Colors^ such that, if {v^w} G A, then 
c{v) 7 ^ c{w). A graph is said to be k-colorahle if the set of colors has cardinality k. 
The least natural number k for which the graph is A:-colorable is called chromatic 
number of graph G and is denoted as x(C) = k. So, any independent set of 
vertices can be monochromatic. The problem of computing y(C) for any given 
graph G is NP— hard [4]. 

Hypergraph H = ( A, T) is an extension of the idea of graph. Edges of a 
hyper graph H can be connected to any number of vertices A, that is, E C P{X), 
A subset of vertices T of a hypergraph i7, which is connected to all the edges of 
the hypergraph is called a transversal set of vertices. The strength of the smallest 
transversal (with the least number of members) of a hypergraph H is denoted 
by t{H). The problem of computing r{H) for a given hypergraph H is AP-hard 
[4]. Moreover, by an independent set of vertices of hypergraph H we mean a set 
of vertices no two of which are connected by an edge of H and by the coloring 
of hypergraph we mean any function that is not monochromatic on the vertices 
of any edge. 

Let Card{X) denote the cardinality of set X. It is said that hypergraph 
H = (A, E) has Helly^s property [1], if every subset of set F is a star: 

yjC{l,2,...,Card{F)}yi,jeJ:EiC\Ej^%^ f| Sfc ^ 0 

fcefi,. . .,n} 

Let H = (A, F) be a hypergraph. Graph Gh = E) is called the representative 
graph of hypergraph H/\tV = E and (/i, / 2 ) € F ^ /i H /2 7 ^ 0. 

Theorem 1. Let H be a hypergraph with Helly^s property. If co — G = (E, V^\E) 
denotes the co-graph of any graph G = (E, F)^ then the sets that generate an 
independent set of vertices in co — G}j is a star. 

Proof. Refer [1]. 

Theorem 2. Let H be a hypergraph with Helly^s property. This leads to the 
following eguation: 

t{H)=x{co-Gh). 



Proof. Refer [9]. 

Properties of Helly are generalized as k-Helly properties. Hypergraph H = (A, F) 
is said to have A:- Helly property, if every subset of set F is k-star: 

yjC{l,2,...,Card{E)}\/ICJ:Card{I)<k&zf]Ek^9^ P| Ek ^ ^ 

kEl 
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Therefore, Helly’s property is the same as 2-Helly property The following fact 
can be derived straight from the definitions. 

Theorems. If hypergraph H = (X^F) is k-Helly and k < h^ then the hyper- 
graph is h-Helly as well. 

Properties of A:-Helly were stated for the first time in combinatory geometry and 
are closely related to the classic definition named after Helly [1]. 

Theorem 4. (Helly) Finite eonvex sets in w have properties of n- Helly. 

This leads to a simple combinatory corollary: 

Theorems. If a given hypergraph H = (X, T) satisfies X = [n]^ then it is 
n-Helly. 

Let H = (X, F) be a hypergraph. Hypergraph Hk = (H, E) is called k-represen- 
tative hypergraph of hyper graph iP, if V = F and 

I eE^{I<Z {l, . . . ,Card{F)} !kCard{I) <k!k P\Eu^%) 

kei 

Therefore, H 2 = Gh- From Theorems 1.1 i 1.2 we can derive the following fact: 

Theorem 6. Let hypergraph H he k-Helly. The sets generating an independent 
set of vertiees in the eo -representative hypergraph^ co — Hk is k- star. 

Theorem 7. Let hypergraph H he k-Helly. Then the following eguation is true: 

r{H)=x{co-Hk), 

3 Decomposition of Finite Functions 

Basic concepts on decomposition of partially defined finite functions are pre- 
sented in this section. Let F : [m]^ ^ [m]^ {F = be a set of finite 

functions, where fi : [m]^ ^ [^])* Let A (hound set) and B (free set) be dis- 
joint partitions of the set of variables Var{F). The deeornposition ehart of the 
function F consists of a two dimensional matrix with indexed columns using the 
values of the bound set variables and indexed rows using the values of the free 
set variables. Elements of the matrix mij are the values assumed by the func- 
tion F for the vectors constructed from i-th row and i-th column. The eolumn 
multiplieity of the matrix is denoted by v{B\A). 

Theorems. Let E : [m]" ^ E = {fi}ie[k] he a group of partial finite 

funetionSy where fi : [m]^ ^ [m]. Let A and B he a pair of disjoint subsets of 
variables Var(F). 

F{A,B) = H{G{A),B) ^ v{B\A) < cP , 

where: G = {gk{A)}k=i,..,j for gk : 

H = {hiX’iA) , u = max{(i,m}. 
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Proof. Refer [8]. 

This theorem suggests that the fundamental problem of decomposing a par- 
tial finite function is nothing but finding an expansion of the function for which 
v{B\A) has the least value. 

P roh lent : k - d ecoitip ose p ardal finite functio ns 

fora gwen pair of disjoint sets ofii^ut vaiiahles 

Given; finite groiga of finite functions F, 

the partition of the set of input variables into a pair of subsets A and B, and keN, 

Frobiem: Is there an expansion for the function F, for which v(E| A) = k ? 



The problem of finding an expansion of the function for which v{B\A) has 
the least value is nothing but a graph coloring problem [12]. 

Theorem 9. Problem k- decomposition of partially defined finite functions leads 
to the problem of k- coloring the graph (in PTIME). 

Proof outline. It can be observed that the columns of the decomposition matrix 
can be treated as an interval in the lattice K = ([m]^, where n and m are 
natural numbers, and is a lexicographical order. Our aim is to verify if we 
can “paste” the columns so that there remains only k columns. This problem 
is equivalent to answering if there exists a A: - element transversal hypergraph 
for the given section. Therefore, the question is equivalent (refer Theorem 1.2., 
an interval in the lattice has 2-Helly’s property) to asking if there exists a k 
- color representative graph (denoted by for the hypergraph. So, we 

need to construct the representative to constructing the incidence graph. The 
time needed is in the order of O(c^r), where c and r are the number of columns 
and rows of the decomposition matrix. 

On the basis of the above explanation, the following heuristic is proposed for 
decomposing finite functions. 

Decomposition heuristic 

Inpvi: finite groi^ of finite functions F and 

a partition of the 119 utvaiiahles into a pair of subsets A and B; 

Ou.ipu.t decomp osition F(A3) = H(G(A),B), minimizing the value V(B|A); 

1 . find the graph of incompatible columns in the decomposition matrl\; 

2. find the minimum coloring for the graph G 

3. find the decomposition H(G (A), B) for the function F from the results obtained in step 2. 



Theorem 10. The problem ofk- decomposition for A: = 1, 2 for partially defined 
Boolean functions (and also for partial finite functions) is in PTIME. 

Proof. (Sketch) The theorem is derived from Theorem 2.2 and from the fact 
that the question of finding if a given graph is two-colorable can be resolved in 
polynomial time. 
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Theorem 11. The problem of k - deeomposition of partial finite funetions for 
k > 3 is NP - eomplete. 

Proof. (Sketeh) The proof is based on two observations. First, the problem of 
k - decomposition of partial finite functions belongs to NP class. Secondly, the 
problem of A: - coloring of graphs is reduced (in PTIME) to the problem of k 
- decomposition of partial finite functions. The reduction procedure is given in 
Algorithm 1. The procedure generates the decomposition table for the given 
graph G, so that = v{B\A). 

Algorithm 1. Symbol denotes donPeare value 

procedure Reduction ; 
begin 

fori := 1 to n do 
begin 

fork := 1 to (i - 1) do v[k, i] 
v[i, i] := ”1”; 
fork := (i + 1) to n do 
if { } e E then v[k, i] := ”0” 

else v[k, i] := 

end; 

end; 



Example 1. This example demonstrates the reduction algorithm for the graph G 
in figure. The algorithm results in: 



Tabfe 1. 



m 




m 


m 




1 


- 




0 


0 


- 


1 


- 


- 


0 


- 


- 


1 


0 


- 


- 


- 


- 


1 


0 




^ 3 



Indeed, \{G) = v{B\A) = 3. 



4 Decomposition of Finite Relations 

This section generalizes the problem of decomposition of partial finite functions 
into the problem of decomposition of finite relations. Finite relation is a function 
expressed as r : X‘^ P{^)- the images of a finite relation belong to a 

certain set X, they need not possess the property of (eg.: {1,2}, {1, 3}, {2, 3}). 
This is unlike the images of a group of partial functions, which were intervals 
in the lattice K = ([m]^, <^). Therefore, we have to apply more subtle tools to 
decompose relations. These factors were not considered in the earlier work [6]. 

Retaining the definitions given in section three and changing the discussion 
from functions to relations, we can state the problem of decomposing a relation 
as follows: 
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Prohleitt: k - deco mposhion of finite relatijon 

for a given pair of digoint sets of ii^ut vatialiles 

Given : finite rei£^on i, a partition of the input variables into a pair of subsets A and B, and keN, 

PyobieTTi: I s there an expansion for the relation r, for Tvhich v(B| A) = k ? 



Theorem 12. Let r : ^ ^ relation. The problem of deeomposing 

relation r ean be redueed to the problem of eoloring the hyper graph. 

Proof. From Theorem 1.7, on the basis of our understanding of decomposition 
table for relation r, we have to construct k - representative hypergraph for k = 
Card{X). Algorithm 2 reduces the problem of decomposing a relation into the 
problem of coloring the hypergraph. If the decomposition table consists of r-rows 
and c-columns, the complexity of the algorithm is approximated to 0(cr^+^). In 
the algorithm, symbols X\a denote the projection of set X on the members of 
set A. 

Algorithm 2. 

procedure ConstructHj^^; 
liegiii 

k :=Card(YX 

fori := 2 to k do 
begin 

V := sue e ssive i- element subsets of the set in lexicographical order; 

if (V is a solitoiy set of the hypergraph H ( 3 ^Ie = 0 ) 

flienF^A3):=F^Ajjw{V}; 

end; 

end; 



5 Computer Test Results 

The presented algorithms were implemented as a package of computer programs 
in the Institute of Telecommunication at Warsaw University of Technology. This 
section presents the results obtained. 

Table 2 presents the results for logic synthesis problems of decomposing di- 
gital circuits to be implemented using PLAs. Silicon area is measured using 
the formula S = P{Tny + 2ri5 + m), where P - number of terms, Uy - num- 
ber of 4 - valued inputs, - number of 2 - valued inputs, and m - number of 
outputs. Character ’6’ means decomposition in binary system and ’m’ means 
decomposition in multiple valued system. 

Table 3 presents results for decomposition of information systems specified 
using “if_then_else_” type rules. Therefore, silicon area is measured using the 
formula A = p ^ where p - number of rules, and qi - number of bits needed 
to represent the i-th attribute. 






Decomposition of Boolean Relations and Functions in Logic Synthesis and Data Analysis 



493 



Table 2. 



Function 


Original si- 
licon area 


Silicon area after 
decomposition (5) 


Silicon area after 
decomposition (m) 


Rd84 


5120 


1280 


1408 


Rd73 


2178 


776 


1142 


Alu2 


26624 


24399 


25124 


Misexl 


5888 


1984 


1984 


Sao 


24576 


8192 


8192 



Table 3. 



Example 


Silicon area before 
decomposition 


Silicon area after 
decomposition 


Monkslte 


4752 


487 


Monks2te 


4752 


507 


Monksltr 


1364 


933 


Monks2tr 


1859 


1095 


Monks3tr 


1342 


929 


House 


3944 


1049 


Nurse 


1365 


475 



Additionally, table 4 contains the computational results of decomposition algo- 
rithms based on different approximation coloring heuristics (namely: the simple 
sequential algorithm, the largest-first sequential algorithm, the smallest-last se- 
quential algorithm, the maximum independent set algorithm) applied to digital 
circuits to be implemented using PLA matrixes. 



Table 4. 



Example 


Original size 


After SSA 


After LFSA 


After SLSA 


After MIS A 


Bbtas 


210 


157 


171 


162 


162 


Ex7 


306 


288 


288 


288 


208 


9sym 


1615 


530 


541 


515 


543 


Trail 


340 


335 


335 


335 


335 


opus 


588 


544 


544 


549 


576 


Sqrt8 


760 


386 


386 


386 


386 


Adr4 


1575 


593 


574 


593 


620 


clip 


2691 


1414 


1393 


1337 


1421 


Z4 


1062 


377 


366 


366 


424 


Sao2 


1392 


810 


820 


792 


774 


root 


1197 


879 


899 


873 


864 



It is evident that the application of decomposition methods has resulted in sig- 
nificant reduction of silicon area. It has also been proved that the generalized 
decomposition methods for multiple valued functions compare well with the clas- 
sical binary decomposition methods. 



6 Conclusion 

This work has shown that the problem of decomposing a finite function /(A, B) 
into h{g{A)^ B)^ where g is a Boolean function, can be resolved in polynomial 
time with respect to the size of the problem. It has been shown that omission 
of the characteristic of the g function can significantly complicate the problem. 
Such generalized problems belong to A^P-hard class of problems. The work pre- 
sented the reduction of the problem of decomposition of a finite function into 
the problem of coloring the vertices of a graph. It was also shown that the pro- 
blem of decomposition of a relation could be reduced to coloring the vertices of 
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a hyper graph. Presented algorithms were implemented in a packet of computer 
programs at the Warsaw University of Technology Two important conclusions 
were drawn from the experiments conducted with the packet of software. First of 
all, the experiments show that the application of decomposition methods leads to 
reduction in silicon area for examples taken from logic synthesis and also for the 
multiple valued examples taken from information systems. Secondly, use of the 
generalized methods - methods to decompose multiple valued functions, instead 
of decomposition methods for Boolean functions, did not have negative influence 
on the effective performance of the presented system. In effect, a more tolerant 
software has been developed that can analyze a wider spectrum of problems. 
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Abstract. In existing studies, diagnostic reasoning has been modeled 
as if-then rules in the literature. However, closer examinations suggests 
that medical diagnostic reasoning should consist of multiple strategies, in 
which one of the most important characteristics is that domain experts 
change the granularity of rules in a flexible way. Eirst, medical experts 
use the coarsest information granules (as rules) to select the foci. Eor 
example, if the headache of a patient comes from vascular pain, we do 
not have to examine the possibility of muscle pain. Next, medical ex- 
perts switches the finer granules to select the candidates. After several 
steps, they reach the final diagnosis by using the finest granules for this 
diagnostic reasoning. In this way, the coarseness or fineness of informa- 
tion granules play a crucial role in the reasoning steps. In this paper, we 
focus on the characteristics of this medical reasoning from the viewpoint 
of granular computing and formulate the strategy of switching the infor- 
mation granules. Eurthermore, using the proposed model, we introduce 
an algorithm which induces if-then rules with a given level of granularity. 



1 Introduction 

One of the most important problems in developing expert systems is knowledge 
acquisition from experts [2]. In order to automate this problem, many inductive 
learning methods, such as induction of decision trees [1, 10], induction of decision 
list[3] rule induction methods[4-7, 10, 11] and rough set theory[8, 13, 17, 18], are 
introduced and applied to extract knowledge from databases, and the results 
show that these methods are appropriate. 

However, it has been pointed out that conventional rule induction methods 
cannot extract rules, which plausibly represent experts’ decision processes[13, 
14]: the description length of induced rules is too short, compared with the 
experts’ rules (Those results are shown in Appendix B). For example, rule in- 
duction methods, including AQ15[7] and PRIMEROSE[13], induce the following 
common rule for muscle contraction headache from databases on differential 
diagnosis of headache[14]: 

[location=whole] & [Jolt Headache=no] & [Tenderness of Ml=yes] 

=> muscle contraction headache. 

W. Ziarko and Y. Yao (Eds.): RSCTC 2000, LNAI 2005, pp. 495-502, 2001. 
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This rule is shorter than the following rule given by medical experts. 

[Jolt Headache=no] 

& [Tenderness of Ml=yes] & [Tenderness of Bl=no] 

& [Tenderness of Cl=no] 

=> muscle contraction headache, 

where [Tenderness of Bl=no] and [Tenderness of Cl=no] are added. 

These results suggest that conventional rule induction methods do not reflect 
a mechanism of knowledge acquisition of medical experts. 

In this paper, we focus on the characteristics of this medical reasoning from 
the viewpoint of granular computing and formulate the strategy of switching the 
information granules. Furthermore, using the proposed model, we introduce an 
algorithm which induces if-then rules with a given level of granularity. 

The paper is organized as follows: Section 2 discusses the characteristics 
of medical reasoning. Section 3 shows formalization of this diagnostic reasoning 
from the viewpoint of information granulation. Section 4 presents a formal model 
of medical differential diagnosis and rule induction algorithm for this model. 
Finally, Section 5 concludes this paper. 

2 Medical Reasoning 

As shown in Section 1, rules acquired from medical experts are much longer 
than those induced from databases the decision attributes of which are given 
by the same experts. This is because rule induction methods generally search 
for shorter rules, compared with decision tree induction. In the latter cases, the 
induced trees are sometimes too deep and in order for the trees to be learningful, 
pruning and examination by experts are required. One of the main reasons why 
rules are short and decision trees are sometimes long is that these patterns are 
generated only by one criteria, such as high accuracy or high information gain. 
The comparative study in this section suggests that experts should acquire rules 
not only by one criteria but by the usage of several measures. 

Those characteristics of medical experts’ rules are fully examined not by 
comparing between those rules for the same class, but by comparing experts’ 
rules with those for another class. For example, a classification rule for muscle 
contraction headache is given by: ^ 

[Jolt Headache=no] 

& ( [Tenderness of M0=yes] or [Tenderness of Ml=yes] 
or [Tenderness of M2=yes] ) 

& [Tenderness of Bl=no] & [Tenderness of B2=no] 

& [Tenderness of B3=no] & [Tenderness of Cl=no] 

^ Readers may say that these two rules are too long to satisfy. However, these attribute- 
value pairs are required for accurate diagnosis although they are redundant and some 
of them are related to others. This redundancy is one of the characteristics of medical 
reasoning. 
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& [Tenderness of C2=no] & [Tenderness of C3=no] 

& [Tenderness of C4=no] 

=> muscle contraction headache 

This rule is very similar to the following classification rule for disease of cervical 
spine: 

[Jolt Headache=no] 

& ( [Tenderness of M0=yes] or [Tenderness of Ml=yes] 
or [Tenderness of M2=yes] ) 

& ( [Tenderness of Bl=yes] or [Tenderness of B2=yes] 

or [Tenderness of B3=yes] or [Tenderness of Cl=yes] 

or [Tenderness of C2=yes] or [Tenderness of C3=yes] 

or [Tenderness of C4=yes] ) 

=> disease of cervical spine 

The differences between these two rules are attribute- value pairs, from ten- 
derness of B1 to C4. Thus, these two rules can be simplified into the following 
form: 



ai&A 2 &-<A 3 ^ muscle contraction headache 
^ disease of cervical spine 

The first two terms and the third one represent different reasoning. The first 
and second term a\ and A 2 are used to differentiate muscle contraction headache 
and disease of cervical spine from other diseases. The third term is used to 
make a differential diagnosis between these two diseases. Thus, medical experts 
firstly selects several diagnostic candidates, which are very similar to each other, 
from many diseases and then make a final diagnosis from those candidates. 

3 Formalization of Medical Reasoning 

3.1 Accuracy and Coverage 

In the subsequent sections, we adopt the following notations, which is introduced 
in [12]. 

Let U denote a nonempty, finite set called the universe and A denote a 
nonempty, finite set of attributes, i.e., a : ^ 14 for a G A, where 14 is called 

the domain of a, respectively. Then, a decision table is defined as an information 
system, A = (f/, A U {d}). 

The atomic formulas over B C AU {d} and V are expressions of the form 
[a = v]^ called descriptors over B, where a E B and G 14- The set F{B^V) of 
formulas over B is the least set containing all atomic formulas over B and closed 
with respect to disjunction, conjunction and negation. 

For each / G T(B, V), /a denote the meaning of / in A, i.e., the set of all 
objects in U with property /, defined inductively as follows. 

1. If / is of the form [a = v] then, /a = {s G U\a{s) = v} 
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2- (/ A g)A = /a n gA] (/ V g)A = /a V gA] (-'/)a = U - fa 

By the use of this framework, classification accuracy and coverage, or true pos- 
itive rate is defined as follows. 

Definition 1 (Accuracy and Coverage). 

Let R and D denote a formula in L\B, V) and a set of ohjeets whieh belong to 
a deeision d. Classifieation aeeuraey and eoverage(true positive rate) for R ^ d 
is defined as: 

aR{D)= ^^^ffk =PiD\R)), and 

nn{D)= P{R\D)), 

where \A\ denotes the eardinality of a set A, aji{D) denotes a elassifieation 
aeeuraey of R as to elassifieation of D, and ^^{D) denotes a eoverage, or a true 
positive rate of R to D, respeetively. 

It is notable that these two measures are equal to conditional probabilities: 
accuracy is a probability of D under the condition of i?, coverage is one of R 
under the condition of D. It is also notable that aR{D) measures the degree of 
the sufficiency of a proposition, R ^ and that nR{D) measures the degree of 
its necessity.^ 

For example, if aR{D) is equal to 1.0, then R ^ D is true. On the other 
hand, if is equal to 1.0, then D ^ R is true. Thus, if both measures are 

1.0, then R ^ D. 

Also, Pawlak recently reports a Bayesian relation between accuracy and cov- 
erage [9]: 

aR{D)P{D) = P{R\D)P{D) = P{R, D) 

= P{R)P{D\R) = nR{D)P{R) 

This relation also suggests that a priori and a posteriori probabilities should be 
easily and automatically calculated from database. 

3.2 Definition of Characterization Set 

In order to model these three reasoning types, a statistical measure, coverage 
nR{D) plays an important role in modeling, which is a conditional probability 
of a condition (R) under the decision D (P{R\D)). 

Let us define a characterization set of D, denoted by L{D) as a set, each 
element of which is an elementary attribute- value pair R with coverage being 
larger than a given threshold, (5^. That is, 

= {[ai = 

^ These characteristics are from formal definition of accuracy and coverage. In this 
paper, these measures are important not only from the viewpoint of propositional 
logic, but also from that of modelling medical experts’ reasoning, as shown later. 
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Then, according to the descriptions in Section 2, three models of reasoning about 
complications will be defined as below: 

1. Independent type: Ls^{Di) fl Ls^{Dj) = 

2. Boundary type: Ls^{Di) fl Ls^{Dj) ^ and 

3. Subcat gory type: Ls^{Di) C Ls^(-Oj)^ 

All three definitions correspond to the negative region, boundary region, 
and positive region [8], respectively, if a set of the whole elementary attribute- 
value pairs will be taken as the universe of discourse. Thus, reasoning about 
complications are closely related with the fundamental concept of rough set 
theory and approximate reasoning[16]. 



3.3 Characterization as Exclusive Rules 

Characteristics of characterization set depends on the value of 6^. If the threshold 
is set to 1.0, then a characterization set is equivalent to a set of attributes in 
exclusive rules [13]. That is, the meaning of each at tribute- value pair in Li,o{D) 
covers all the examples of D. Thus, in other words, some examples which do not 
satisfy any pairs in Li^q{D) will not belong to a class D. 

Construction of rules based on Li,o are discussed in Subsection 4.4, which 
can also be found in [14, 15]. The differences between these two papers are the 
following: in the former paper, independent type and subcategory type for Li,o 
are focused on to represent diagnostic rules and applied to discovery of decision 
rules in medical databases. On the other hand, in the latter paper, a boundary 
type for Li,o is focused on and applied to discovery of plausible rules. 

3.4 Characterization in Diagnostic Reasoning 

Let us return to the example in Section 2. The two rules are represented as: 

ai&A 2 &-iA 3 ^ muscle contraction headache 
ai^A 2 ^As disease of cervical spine 

Prom the viewpoint of characterization set,ai and members of A 2 should be 
included in the characterization sets of both classes. On the other hand, members 
of As are included only in that of muscle contraction headache.^ That is, 

€ Li,o(m.c./i.), A_2 C Li,o(m.c./i.), 

€ Li,o(d.c.s.), A2 C Li,o(d.c.s.), 

As C Li^o{m.c.h),As Li,o(d.c.s.), 



^ For simplicity, the thresholdt^^ is set to 1.0 in the following discussion because = 
1.0 corresponds to exclusive reasoning(characterization)[13, 14]. However, it is easy 
to show that this condition is not actually necessary for the discussion. It is only 
needed for interpretation from the medical side. 
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where m.c.h. and d.c.s. denote muscle contraction headache and disease of cer- 
vical spine, respectively. These facts are summarized into: 

{ai, ^2} C Li^o{m.c.h.) fl Li,o(d.c.s), 

^3 ^ Li^o{m.c.h.) — Li,o(d.c.s). 



Thus, the relation between characterization sets of m.c.h. and d.c.h. is boundary 
type, and the difference set between these two characterization sets are important 
to discriminate between these two diseases. Prom the above discussion, it is easy 
to see that there are two types of at tribute- value pairs: the first one describe 
the characteristics shared by these two diseases and the second one describe 
the discrimination between them. In this way, discrimination can be viewed as 
the use of at tribute- value pairs belonging to the difference set between decision 
attributes (target classes). That is. 

Definition 2 (Discriminating Descriptors). 

Let Di and D2 denote decision attributes and let A{Di^ D2) = Li,o(I^i) — 
Li,o(T>2). Formulae in Z\(L>i,L>2) called discriminating descriptors. On the 
other hand, formulae in Li,o(T^i)nLi,o(T^2) cire called indistinguishable descrip- 
tors. 

To find rules discriminating between Di and D2 is equivalent to find discrim- 
inating descriptors of Li,o{Di) and Li, 0(1^2)- 

For the above example shown in Section 2 , A{m.c.h, d.c.s) is equal to As. It is 
notable that the domain for the characterization sets, in other words, selection 
of attributes as domain is very important to classify a type of relations between 
characterization sets. If we change the domain for the characterization sets, then 
they will have a different view for each decision attribute. For example, if we 
select {ai, A2} as a domain, we cannot distinguish Li,o{m.c.h.) and Li,o(d.c.s) 
(subcategory type) . On the other hand, if we select As as a domain, we can dis- 
tinguish between these two sets. For evaluation of the nature of characterization, 
we can define a index for discrimination power. 

Definition 3 (Discrimination Power). Let Di and D2 denote two decision 
attributes and let a set of all descriptors denote V. Discriminant power ofV for 
Di and D2 is defined as: 

\A{DiW2)\ 

\v\ ■ 

4 Rule Induction based on Medical Diagnosis 

4.1 Modelling Medical Diagnosis 

As shown in the above subsection, if a set of discriminating descriptors ( A (L>i, D2)) 
is equal to the selected domain(T>s), then corresponding characterization sets are 
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independent. If A{D\^ D 2 ) is empty, then Li,o(^i) and Li,o(^ 2 ) are of the sub- 
category type. 

In the case of subcategory type, we can group these decision attributes into 
more generalized attributes. For example, if we take a\^A 2 a Dg, then m.c.h. 
and d.c.s. can be grouped into one decision attribute, say m.c.h._and_d.c.s. Thus, 
medical differential diagnosis process can be modeled as follows. 

1. Detect Subcategory and Independent type (to group several d^). For each 
subcategory case, make a generalized decision attribute (D^). 

2. From observations of a case, check whether this case belongs to a group Di. 

3. Then, discrimination between Di and Dj will be applied. 

4.2 Rule Induction Method 

Based on the above medical diagnosis model, we obtain the following algorithm 
for rule induction with grouping. 

1. Generate Characterization Sets: Li,o(d^) 

2. Detect Subcategory and Independent type 

3. Grouping Decision Attributes [gi) of Subcategory type: make partition of 
grouped attributes. 

4. Apply Rule Induction to gg. C\ gj^ which discriminate between grouped 
attributes 

5. Apply Rule Induction within gi A Cj dk 

6. Integrate Rules: {Ci ^ gj) {gi A Cj dk) 

This algorithm was first introduced in [14] and extended into probabilistic case(^ < 
1.0). In both papers, these algorithms were evaluated on three medical databases, 
the experimental results of which show that these algorithms generate rules more 
similar to medical experts’ rules than the conventional rule induction methods. 



5 Conclusions 

In existing studies, diagnostic reasoning has been modeled as if-then rules in the 
literature. However, closer examinations suggests that medical diagnostic rea- 
soning should consist of multiple strategies, in which one of the most important 
characteristics is that domain experts change the granularity of rules in a flexible 
way. In this paper, we focus on the characteristics of this medical reasoning from 
the viewpoint of granular computing and formulate the strategy of switching 
the information granules. Furthermore, using the proposed model, we introduce 
an algorithm which induces if-then rules with a given level of granularity. This 
paper is a preliminary study on application of granular computing method to 
medical diagnosis. Further formal studies will be shown in the near future. 
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Abstract. We discuss how Case Based Reasoning (CBR) (see e.g. [1], 
[4]) philosophy of adaptation of some known situations to new similar 
ones can be realized in rough set framework [5] for complex hierarchical 
objects. 

We discuss how various problems can be represented by means of com- 
plex objects described by hierarchical attributes, and how to use similar- 
ity between them for predicting the relevant algorithms corresponding to 
these objects. The complex object attributes are of different types: basic 
attributes related to problem dehnition (e.g. features of object parts), 
attributes reflecting some additional characteristic of problem (e.g. fea- 
tures of more complex objects inferred from properties of their parts and 
their relations), and attributes representing algorithm structures (e.g. 
order and/or properties of operations used to solve the given problem). 
We show how to dehne these particular attributes sets, and how to rec- 
ognize the similarity of objects in order to transform algorithms cor- 
responding to these objects to a new algorithm relevant for the new 
incompletely dehned object [1,4]. 

Object similarity is dehned on several levels; basic attribute recognition 
level, characteristic attribute recognition level and algorithm operation 
recognition level. Dependencies between attributes are used to link dif- 
ferent levels. These dependencies can be extracted from data tables spec- 
ifying the links. 

We discuss how to classify new objects, and how to synthetize algorithm 
for such new object, on the basis of algorithms corresponding to similar 
objects. The main problem is the generation of rules enabling to create 
operation sequences for a new algorithm. These rules are generated using 
rough set approach [5]. 



1 Problem Representation As an Complex Object 

We discuss methods of solving various problems, which can be specihed, by 
means of some hierarchical attributes. Examples of such problems are simple 
mathematical tasks or hnding the way out in the maze. 

The problem-case is represented as a complex object (constructed hierarchi- 
cally) dehned by some attributes and its information signature (attribute value 
vector). Information signature of any object O E U is dehned by attribute value 
vector, i.e., InfA{0) = {(a,a(0)) : a E A}, where a{0) is the value of attribute 
a, on the object O [5]. 



W. Ziarko and Y. Yao (Eds.): RSCTC 2000, LNAI 2005, pp. 503-510, 2001. 
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Objects described by the same attributes are represented in information sys- 
tem [5]. A complex object can be described as an uncomposed one - using some 
general attributes (e.g. specifying general object type - e.g. geometrical figure or 
maze, its main components - e.g. two triangles, square maze) (Table la), or it 
can be described by some more detailed attributes after decomposition it up to 
some level, (Table lb). Values of attributes are binary ones (0, 1): 1 - means the 
attribute value is present (e.g. crossings in the maze are perpendicular) or its 
real value (which can be taken from the content of task) is known (e.g. square 
edge length is known), 0 - means the attribute value is not present or its real 
value is unknown. 

Some of complex object attributes values point out to its subobjects (specified 
in other information systems) and specify how they are related (Table lb). 

Among object attributes we distinguish attributes, called algorithm (opera- 
tions) attributes, with values informing if the operation is enabled. Such enabled 
attributes fire the corresponding operations transforming the object, and let to 
solve a given problem or to make some decisions. 



Table la 
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are general attributes describing the object, (opi, .., opm) algorithm 
attributes (operations) transforming the object. 

In Table lb several subsets of the complex object attribute set are distin- 
guished: 

(i) problem definition attributes - basic attributes - (ai, .., a^), (6i, 6^), 

(ii) relational attributes - declaring relations between component objects of the 
complex object - (ri, ..,r^), 

(hi) problem additional characteristics attributes - (ci, ..,Cfc), 

(iv) algorithm operations transforming object - the given problem - (opi, .., op^). 

By solving the problem we mean finding for a given object all values of 
attributes corresponding to operations to be executed to solve the problem - e.g. 
the way out in the maze is found (decision - answer for question if we are in the 
point of maze exit, is true), or to find value vector of some other missing object 
attributes, e.g. finding area of geometrical figure. 

We solve the given problem by computing step by step algorithm operations 
for which its value vector is 1 (true) in an order specified by hierarchical structure 
of tables. This let us to find some required values of object attributes. To execute 
the particular operation, we need some other object attributes values. Some of 
them are from the main information system, but some of them must be taken 
from other information systems for some properly chosen subobjects (Fig.l). 
We know attribute values for a,6,A,T>, but we are looking for attribute values 
of c, d. 
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Algorithm operations: opi(a, b)= c; op 2 {C^ F, c) = d; ops{D^ E) =F; op 4 (A, B) = 
(7, can be represented by dependencies [5]: ab c; CFc d\ DE F\ AB 
(7, which in turn represent graphically a dependency (Fig. 2) ABDEab d. 

One may notice, that to execute the main object algorithm operation op 2 we 
need attribute values of c, (7, F, which are obtained from operations opi, op 3 , op^. 
Operation opi is the main object algorithm operation, while operations ops^op 4 
are subobjects algorithms operations. 

Each complex object (from information system) representing the problem - 
case, is represented in a hierarchical system (Fig. 3). 

The components of such system are information systems [5] related in a 
specific order. Component objects are complex objects (defined hierarchically) 
or simple object (defined by primary attributes - primary concepts of domain 
in which the problem is included). Basic attributes are taken from the problem 
definition (content); other specify (map) subobjects - component objects (from 
other information systems - Table 2, Table 3). They are formulated (by the 
expert) in the way which let to specify some attributes and its value vectors 
from the sets of basic, relational and characteristic attributes for the subobject 
[2]. Thus we may obtain particular subobject (subobjects) of the complex object. 
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Table 3 - (objects of type B) 
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Relational attributes declare relations between component objects from which 
the complex object is constructed. Characteristics attributes are defined by an 
expert or system rules, and declare an extra description of the complex object, 
which let the system to input the proper object transformation rules. Algorithm 
operations are attributes transforming the object in order to find the missing 
value vector of some attribute [2]. 
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From each information system one may derive some rules defining algorithm 
attributes (operation) using the basic, relational and characteristic object at- 
tributes. The rules are of the form a op{0) for some op G OP, where a is a 
conjunction of descriptors (a^v) for some a E A and v E 14 . On the right hand 
side of the rule more than one operation attribute can appear. 

To execute for main complex object an algorithm operation we may need 
values obtained by subobject algorithm operations. To specify such situation we 
introduce special information table (Table 4.), consisting of necessary additional 
information from lower levels of hierarchical structure. 
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(op4**Pi^n) operations for subobject algorithm and (opi,.., 

opk) are operations of main object algorithm; and (tci,..,tc4 some additional 
attributes specifying hierarchical structure of main algorithm operations. 



Example 1. Let us consider a simple geometrical figure (or maze) (Fig. 4). This 
figure represents a hierarchical system. We may decompose such a figure to com- 
ponent objects (Fig. 5). Such object decomposition can be done in different ways 
(Fig. 5a). The most difficult problem is to define (find) the proper decomposi- 
tion method for the case. Component object may represent a complex object 
itself, and can be decomposed to its component objects - complex or simple 
(Fig. 6). Component objects of a complex object are related by relation defined 
by relational attributes - e.g. describing how such component objects are placed 
towards themselves. 
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Each component object is labeled by algorithm transforming it - e.g. which let 
to obtain some of its attributes from others attributes or components (Fig. 7). 

Each component object can be related to another component object of a 
different complex object. Such relation let us to transform this object to the 
other one. 



2 Similarity (Closeness) of Objects 

One of the main problem to be solved is to construct relevant similarity measures 
between complex objects. The similarity of objects is relevant, if on the basis 
of it we may specify a proper set of algorithm attribute value (operations) for 
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the new object, on the basis of known objects at some level. To do this complex 
objects must be decomposed up to some level. We compare complex objects 
(from the same or different information systems) by checking consistency of 
attributes and its value vectors. We start to check similarity of such objects by 
comparison attributes and its value vectors for the main complex object and 
next, by comparison attributes and its value vectors for its component objects. 
The number of concordant attributes and their values for the main complex 
objects and its component objects define the level of similarity, and thus some 
relations on the basis of which one may define similarity. 

Let us consider Example 1. Complex objects are decomposed (Fig. 4-Fig. 6 ) in 
order to find the common set of object attributes, which can be measured. If such 
decomposition is proper, we may define on the basis of concordant attributes and 
their value vectors relations of similarity degree of objects. We measure similar- 
ity on several levels. First it is the similarity of objects from main information 
system (Fig. 4), next it is the similarity of subobjects (from subinformations sys- 
tems) (Fig. 5, Fig. 6 ). Similarity of main objects is more general, while similarity 
definition for subobjects is more detailed. If the main complex objects are not 
similar in a sufficient degree on high level attributes, we may try to define their 
similarity in a more detailed way, by taking into account their subobjects and 
similarity between them. 

Objects from different information systems are described by different at- 
tributes and their value vectors, that is why we may define two types of similarity 
relation: 

- consistency of attributes, and consistency of their value vectors, 

- consistency of attributes, and inconsistency of their value vectors. 

Objects from the same information system are described by the same attributes 
but with different their value vectors. For such objects there is one relation type 
- consistency or inconsistency of their value vectors. 

One object can occur to be similar to several other objects by using the 
different similarity relation, depending on a quantity of consistent attributes 
and their value vectors and level of such consistency. 

For any given object O it is extracted from hierarchical information system 
a set O of objects similar to O. Algorithms corresponding to objects from O 
should determine a proper algorithm for O. This is done by learning procedure. 
However, some simple heuristics can be also used based on similarity of objects 
(Fig. 8 ) what will be discussed later. 



0= {O, Oi, .., O 5 } is the set of objects similar to O and Ai, .., A 5 are algorithms 
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transforming objects Oi, O 5 . 

Example 2. If on the top level OSirritopO^ (where Sirritop - similarity on top level) 
and the value of attribute e.g. Area for O has been computed by decomposition of 
O into C>i and O 2 and by performing the addition of Areai(Oi) and Area2{02). 
then we try to find first a decomposition of into and O 2 next to 
compute its area, using the same operation of addition (Fig. 9). In some cases, 
for 05 ^, O 2 we search on corresponding levels of decomposition for similar objects 
and we follow the decomposition procedure for them. 

3 New Object and Its Decomposition 

Let us consider a new complex object. Now, the main problem is how to con- 
struct hierarchical structure of component objects for the new object by taking 
into account its similarity to the others objects and relations between extracted 
similar objects on the basis of their attributes. The new object is matched against 
complex objects represented in our knowledge base. The main goal of general 
strategy is to isolate possible component objects, by using similarity to oth- 
ers component objects and relations between them. In this way objects similar 
to a given new object in a satisfactory degree are extracted. On the basis of 
the extracted complex objects the proper operations of algorithms are selected: 
Attributes of extracted object and rules for the information system to which 
new object is assigned, point out operations of algorithm transforming such new 
object. 

Here one can find the problem that not all values of attributes of the new ob- 
ject are known. Let us consider the rule {(a, a(0), (6, 6(0)), (c, c(0))} ^ op{0). 
For a new object attributes can be known only to (a, a{0)) and (6, 6(0)). There is 
a problem if the system should start operation op{0) or not. We may try to solve 
such problem by taking into account some attributes or rules for the subinfor- 
mation systems for component objects. Considering subinformation systems one 
may find the missing attribute (c, c(0)) which let to start the operation op(0), 
or just to skip this operation, and execute successfully the following operation 
of the main object algorithm. In some of the problems some missing attributes 
can be skipped, e.g. going trough the maze - we may try to find another way, 
but in some problems the missing attribute can not be skipped easily, e.g. in 
mathematical tasks - we have to find particular attribute value vector in order 
to solve the whole problem. That is why for some objects we have to declare very 
precise algorithm, with all attributes defined precisely (e.g. mathematical tasks), 
but for some objects we may declare more general algorithm with more general 
attributes (e.g. maze problem) which can be modified during its execution. 

Example 3. We are looking for similar object to the new one. This step in CBR 
cycle [1] is called Retrieve. Let us consider a new complex object - new geomet- 
rical figure or maze. To assign the new proper algorithm for such object we have 
to decompose it up to some level. First we try to find the most similar main 
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information system for the new object, taking into account general problem def- 
inition attributes - basic attributes and general relational attributes - declaring 
general relations between component objects of the complex object. If some al- 
gorithm operations - transforming the new object are known, e.g. we have some 
experience in going trough the new maze - we passed some its corridors suc- 
cessfully, we take them into account as well, while the most similar information 
system for the new object is assigned. Next we try to specify the most similar 
object or objects from the chosen information system for new object, taking 
into account basic attributes, detailed relational attributes, problem additional 
characteristics attributes and some known algorithm operation - attributes. The 
object or objects with maximum number of attributes and their values vectors 
consistent with attributes and their values vectors for the new object is chosen. 
If the most similar object or objects for the new object are extracted, we spec- 
ify component objects (by its attributes) of the complex object. To do this we 
consider information systems for the component objects. In this way the most 
similar complex object (objects) is chosen for the new object. Known attributes 
for the new object and rules (obtained from information system to which the 
new object is assigned) defining object algorithm attributes let us to obtain some 
algorithm attributes (operations) for the new object (Fig. 8, 9). 

It can happen that some chosen operations for the new object will not return 
the expected attribute values. In such case these “wrong” algorithm operations 
must be corrected and new missing operations specified. These steps in CBR 
cycle are called Reuse and Revise. 

Let us consider the new algorithm Alg^, (defined by the rules) for the new 
object. Algn = {opni^opn 27 ••^opnm)' Wc Start to perform the operations op^i, .., 
opnm • To execute some of the operations we may need values got from previous 
operation or operations. If such needed value is missing the next operation can 
not be executed. In such situation we try to find the missing value (e.g. the edge 
length of some geometrical figures). To do this we consider another information 
system (sub- information system) (Table 2 or 3) for the component object of the 
complex object (we perform algorithm for the component object from which we 
try to get the missing value). If the missing value is obtained, we perform next 
operations of the main algorithm, if not we must modify some operations of the 
main algorithm. 

For a given object with a strict structure, e.g. some mathematical tasks, 
to execute corresponding to this object algorithm it is necessary to perform all 
operations, and that is why we need all attributes value vectors. If some values are 
missing and if we can’t obtain it from sub-information systems, we can’t execute 
the operation. For some another objects, e.g. maze, if some attributes value 
vectors needed to execute the algorithm operation are missing, we may try to skip 
such operation, and execute another one. To do so, we must sometimes return 
to some already executed operations, and next perform some other operations. 
For example, if we can’t pass the chosen corridor in the maze, we must go back 
to the corridors crossing and choose another corridor. In this way we correct the 
wrong algorithm operations. 
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4 Conclusions 

We have presented the main idea on which a software system for problem solving 
is under the development. 

We have discussed how to represent the various problems by means of com- 
plex objects represented by some hierarchical attributes, and how to use similar- 
ity between them for predicting the relevant algorithms corresponding to these 
objects. 

The most difficult problem is the proper decomposition of the new complex 
object. On the basis of object attributes and their value vector we may predict 
similarity of the new object to the known ones. The level of such similarity speci- 
fies chances for developing the proper algorithm for the new object. Here we may 
notice three categories of the new objects: those which are similar to the known 
objects in a satisfactory degree, partial satisfactory degree and unsatisfactory 
degree. Any object similar in a satisfactory degree to the known objects allows 
to construct a correct algorithm. For an object similar in a partial satisfactory 
degree there is only a chance to construct a correct algorithm. Finally, for any 
objects similar in an unsatisfactory degree it is not possible to construct a correct 
algorithm. 

We have distinguished at least two types of objects, those which are specified 
by some precise attributes - e.g. in case of mathematical tasks, and those which 
are specified by less precise attributes - e.g. in case of maze problems. For these 
two types different types of algorithms must be created. 

We have outlined methods of retrieving similar objects for the new case, and 
reusing known algorithms for new objects using ideas of CBR cycle [1]. 

Acknowledgment. The author is due to thank Professor Andrzej Skowron for 
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Abstract. Rough sets have traditionally been applied to decision (clas- 
sihcation) problems. We suggest that rough sets are even better suited 
for reasoning. It has already been shown that rough sets can be applied 
for reasoning about knowledge. In this preliminary paper, we show how 
rough sets provide a convenient framework for uncertainty reasoning. 
This discussion not only presents a new topic for future research, but 
further demonstrates the flexibility of rough sets. 



1 Introduction 

The theory of rough sets [4, 5, 9, 10] generalizes traditional set theory by allowing 
a concept to be described approximately by a lower and upper bound. Although 
rough sets have been extensively studied, most of these investigations demon- 
strated the usefulness of rough sets in decision (classification) problems. Wong [8] 
first demonstrated that rough sets can also be applied for reasoning about knowl- 
edge. This observation was also made later by Salonen and Nurmi [6]. 

In this preliminary paper, we extend the work in [6, 8] by demonstrating 
that rough sets can also be applied for uncertainty management. In [6, 8], rough 
sets are used as a framework to represent formulas such as “player 1 knows 
By incorporating probability, we can now represent sentences such as “the 
probability of </>, according to player 1, is at least , where is a formula and 
a is a real number in [0, 1]. Thereby, not only does this discussion present a new 
topic for future research, but it further demonstrates the flexibility of rough sets. 

The remainder of this paper is organized as follows. Kripke semantics for 
modal logic are given in Section 2. The key relationships between rough sets 
and the Kripke semantics for modal logic are stated in Section 3. In Section 4, 
probability is incorporated into the logical framework. In Section 5, we demon- 
strate that rough sets are also a useful framework for uncertainty reasoning. The 
conclusion is given in Section 6. 

2 Kripke semantics for Modal Logic 

Consider an ordered pair < lU, > consisting of a nonempty set W of possible 
worlds and a binary relation R on W. Let Q denote the set of sentence letters 
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(primitive propositions). An evalnator fnnction /: 

f :W xQ^{T,L}, 

assigns a trnth-valne, T or _L, to each ordered pair {w,q), where le G PA is a 
possible worlds and g' G Q is a sentence letter. We call the triple M =< W, R, f > 
a model (strnctnre), and R a possibility (accessibility) relation. 

The fnnction of an evalnator is to determine which primitive proposition q is 
to be true at which world le in a model. We write (M, le) |= q^ if f{w, q) = T. We 
can now define what it means for a proposition (formnla) to be trne at a given 
world in a model by assnming that |= has been defined for all its snbformnlas of 
(f. That is, for all propositions p and i/j, 

(M, le) 1= A iff (M, w) \= (f and (M, le) |= i/j, 

{M, w) ^ -ip iff (M, w) ip, 



and 



(M, le) 1= Up iff (M, x) 1= for all x snch that (le, x) G R. 

The above definition enables ns to infer indnctively the trnth-valne, i.e., 
(M, s) 1= p^ of all other propositions from those of the primitive propositions. 
We say is trne at (M, s)” or holds at (M, s)” or “(M, s) satisfies , if 
(M, s) ^ p. 

In order to establish a connection with rongh set theory, we review the notion 
of an incidence mapping [1], denoted by I. To every proposition p, we can assign 
a set of worlds I{p) defined by: 

I{p) = {wE W\{M,w) ^ p}. 

This fnnction is nsed in establishing the relationship between a Kripke strnctnre 
and an Anman strnctnre in the recent work of Fagin et al. [2]. The important 
point of this discnssion is that the incidence mapping I provides a set-theoretic 
interpretation of Kripke semantics. 

3 Rough Sets versus Kripke semantics 

The original motive of rongh sets [5] was to characterize a particnlar concept 
(represented by a snbset of a finite nniverse W of interest) based on the infor- 
mation (knowledge) on hand. This knowledge is represented by a binary relation 
R on W. Rongh sets can be viewed as an extension of ordinary sets, in which 
a set A C IT is described by a pair (A, A) of snbsets of IT. Note that A and 
A are not necessarily distinct. For onr exposition here, we may assnme that R 
is an eqnivalence relation. In this case, rongh sets are defined by the following 
knowledge operator K: for all A C IT 

A = K{A) = {le G IT I [w]r C A}, 
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and 



A = -K(-A) = {wew \ [w]r n a 7^ 0}, 

where [w]r denotes the eqnivalence class of R containing the elements w G W, 
In the theory of rongh sets, we call A the lower approximation and A the npper 
approximation of A, 

It was shown [7] that the Kripke semantic model is eqnivalent to the charac- 
terization of modal propositions by a rongh-set model. That is, each proposition 
(f ^ L can be represented by a snbset of possible worlds and the modal operator 
□ by the knowledge operator K dehned above. The key relationships between 
the Kripke semantic model and the rongh-set model are snmmarized as follows: 

(i) (M, tc) 1= (p iff tc G 
{a) (M, tc) 1= iff tc G 

The above resnlts enable ns to adopt rongh sets for reasoning abont knowledge 
instead of nsing the framework based on modal logic as snggested by Fagin et 
al. [2]. 

We conclnde this section with an example [2] to illnstrate how the rongh-set 
model is nsed in reasoning. Consider a deck of cards consisting of three cards 
labeled X,Y and Z. Assnme there are two players (agents), i.e., G = {1,2}. 
Players 1 and 2 each gets one of these cards. The third card is left face down. 
We describe a possible world by the cards held by each player. Clearly, there 
are six possible worlds, i.e., W = {(X, T), (X, Z), (T,X), (y,X), (Z,X),(Z,Y)} 
= {tci, IC2, tC3, IC4, IC5, tcej. For example, W2 = (X,Z) says that player 1 holds 
card X and player 2 holds card X. The third card Y is face down. We can 
easily const met the two partitions tti and 7T2 of IF, which respectively represent 
the knowledge of the two players. For example, wi = (X, F) and W 2 = (X,X) 
belong to the same block of tti because in a world such as wi = (X, F), player 1 
considers two worlds possible, namely wi = (X,F) itself and W 2 = (X, Z). That 
is, when player 1 holds card X, he considers it possible that player 2 holds card 
F or card X. Similarly, in a world wi = (X, F), player 2 considers the two worlds 
wi = (X, F) and wq = (X, F) possible, i.e., wi and wq belong to the same block 
of 7T2. Based on this analysis, one can easily verify that: 

= {[Wi, W2]lX, [■W3,U’4]iY, [w3,Wq]iz}, 

7T2 = {[W3, Ws\2X, [Wl, We]2Y, [m, W^hz}- 

It is understood that in both worlds wi and W2 of the block [wi,W2]ix in tti, 
player 1 holds card X; in both worlds wi and we of the block [rci, we] 2 Y^ player 
2 holds card F, and so on. The corresponding equivalence relations Ri and R2 
can be directly inferred from tti and 7T2. In this example, we have six primitive 
propositions: IX denotes the statement “player 1 holds card X”, IF denotes the 
statement “player 1 holds card F” , . . ., and 2X denotes the statement “player 
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2 holds card . Each of these propositions is represented by a set of possible 
worlds. By the definition of the mapping /, we obtain: 

/(IX) = {wi, W2}, I{IY) = {W3, Wi], I{IZ) = {W5, We], 

/(2X) = {W 3 , We], /(2Y) = ]wi,we], I{2Z) = {w^, w^]. 

Using these primitive representations, the representations of more complex propo- 
sitions can be easily derived from properties (if) — (i5). For example, 

/(IX A2Y) = /(IX) n/(2y) 

= [wi,W2] n {wi,we] = {wi}, 

I(2YV2Z) = I(2Y) \JI{2Z) 

= {wi,we] U {w2, W4] = {wi,W2, W4, We]. 

More interesting is the following expression which indicates that if player 1 holds 
card X, then he knows that player 2 holds card Y or card 

/(□i(2Y V 2Z)) = Ah(/(2y V 2Z)) 

= I C /(2y V2A)} 

= I [H/rinTTs ^ {wi,W2,W4,We}} 

= {n?i, IC2}. 



4 Incorporating Probability 

The discnssion here draws from that given by Halpern [3]. The langnage is ex- 
tended to allow formnlas of the form Pi{<f)) > a, Pi{<f)) < ci, and Pi{<f)) = a, 
where is a formnla and a is a real nnmber in the interval [0,1]. A formnla snch 
as Pi{<f)) > a can be read “the probability of </>, according to player i, is at least 
a” . 

To give semantics to snch formnlas, we angment the Kripke strnctnre with a 
probability distribntion. Assnming there is only one agent, a simple probability 
structure M is a tnple (lU, p, tt), where p is a discrete probability distribntion 
on W. The distribntion p maps worlds in W to real nnmbers in [0,1] snch that 
J2wew P('^) = l-O- We extend p to snbsets A of lU by p{A) = p{w). We 

can now define satisfiability in simple probability strnctnres: the only interesting 
case comes in dealing with formnlas snch as Pi{<f)) > a. Snch a formnla is trne, 
if: 



(M, tc) 1= P{<^) > Cl if p({tc|(M, tc) 1= (/)}) > a. 

That is, if the set of worlds where <f) is trne has probability at least a. The 
treatment of Pi{<f)) < ci, and Pi{<f)) = a is analogons. 
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Simple probability structures implicitly assume that an agenUs (player’s) 
probability distribution is independent of the state (world). We can generalize 
simple probability structures with probabzUstzc Kripke structures by having p 
depend on the world and allowing different agents to have different probability 
distributions. 

A probabzUstzc Krzpke structure M is a tuple (W,pi,...,Pn,^), where for 
each agent i and world w, we take Pi{w) to be a discrete probability distribution, 
denoted over W. To evaluate the truth of a statement such as P*(</>) > a 
at world w we use the distribution pi^w'^ 

(M, w) ^ > a ifpi^uj{{w\{M, w) ^ <^}) > a. 

We now combine reasoning about knowledge with reasoning about probabil- 
ity. A Krzpke structure for knowledge and probabzlzty is a tuple {W, /Ci, . . . , /Cn, 
Pi, - - - , Pn, This structure can give semantics to a language with both knowl- 
edge and probability operators. A natural assumption in this case is that, in 
world w, agent i only assigns probability to those worlds ICi{w) that he consid- 
ers possible. (However, in some cases this may not be appropriate [3].) 

We use the following example from [3] to illustrate a logical approach to 
reasoning about uncertainty. Alice has two coins, one of which is fair while the 
other is biased. The fair coin has equal likely hood of landing heads and tails, 
while the biased coin is twice as likely to land heads as to land tails. Alice chooses 
one of the coins (assume she can tell them apart by their weight and feel) and 
is about to toss it. Bob is not given any indication as to which coin Alice chose. 

There are four possible worlds: 

W = {wi = {F,H), W2 = {F,T), ws = {B,H), ic4 = (5,T)}. 

The world wi = (T, H) says that the fair coin is chosen and it lands heads. We 
can easily construct two partitions tt A lice and ttboS of W, which represent the 
respective knowledge of Alice and Bob: 

TT Alice = {[wi,W2], [w3, Wi]} , 

T^Bob - {[Wl, W2, W3, W4]}. 

The corresponding equivalence relations Raucc and Rsob can be directly inferred 
from TV Alice and tv Bob- In this example, we consider the following four proposi- 
tions: / - Alice chooses the fair coin; b - Alice chooses the biased coin; h - The 
coin will land heads; t - The coin will land tails. 

We first define a probability distribution p Alice, according to Alice, for each 
of the worlds w CW.ln world wi = [H , T), pAiice,wi{wi) = 1/2, pAiice,wi{^ 2 ) = 
1/2, PAh'ce, »i(w3) = 0 . 0 ,PAlice,w^{wi) = 0 . 0 . For World W 3 = {B,T),P Alice, Wciwi) 
= 0 . 0 , PAiice,w 3 {w 2 ) = 0 . 0 , PAiice,we{w 3 ) = 2/3, PAiice,we{wi) = 1/3. These deh- 
nitions are illustrated in Figure 1. 

It can be verified that p Alice, - PAHce,wi and PAUce,w^ - PAHce,w^- More- 
over, Bob’s probability distributions are the same as Alice’s, namely, 

PBob,vji — PAlice,vj\-> ^ — 1,2, 3, 4. 
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Fig. 1. A knowledge system for Alice. 



The truth evaluation function tt maps 7 i{h^wi) — true, 7 r{h,W 2 ) = true, 
7 r{h, ws) = false, 7 r{h, W 4 ) = false. Thus, I{h) = {wi, W 2 }. 

It can now be shown that 

{M,Wi) ^ PAlice{h) = 1/2, 

since p Alice, wi{{wi,W 3 }) = 1/2. Similarly, 

(M, W 2 ) ^ PAlice{h) = 1/2. 

This means that Alice knows the probability of heads is 1/2 in world wi: 
(M,Wi) ^ OAlice{PAlice{h) = 1/2), 

since {M,wi) ^ PAiice{h) = 1/2, ^ PAiice{h) = 1/2, and [^ 1 ,^ 2 ] is an 

equivalence class in Rahcc- 

The same is not true for Bob. Note that 

{M,wi) ^PBobih) = 1/2, 

since PBob,wi({wi, W 3 }) = 1/2. However, 

{M,ws) ^PBobih) = 1/2, 

since PBob,w 3 ({wi, W 3 }) = 2/3. Therefore, 

{M,wi) ^ aBAPBobih) = 1/2), 

since for instance (wi,ws) G RBob- This says that Bob does not know that the 
probability of heads is 1/2 in world wi. 

5 Rough Sets for Uncertainty Reasoning 

Recall that each proposition is represented by a set of possible worlds. The 
proposition PAUce{h) = 1/2, for instance, is represented by 

I{PAiice{h) = 1/2) = {wi,W2}. 

Similarly, proposition PBob{h) = 1/2 is represented by 

I{PBob{h) = l/2) = {wi,W 2 }. 





Rough Sets for Uncertainty Reasoning 517 



Recall the following results obtained in a previous section using a logical 
framework: 

[M,Wi) \=^Alice[PAlice{h) = 1/2), 

(M, W 2 ) \= □ Alice [PAliceih) = 1 / 2 ). 

This knowledge can be expressed using the following proposition in rough sets: 

(-PAh’ce (^) — 1/2)- 

By definition, this proposition is represented by the following worlds: 

Alice AliceiJ^) — ^/^)) — -^Uh’ce j ^^2 }) 

- {w \ [w]Alice C {wi,W 2 }} 

^{WI,W2}. (1) 

This result is consistent with our earlier result that: 

I{^Alice{PAlice{h^) = 1 / 2 )) = {wi,W2}. 

Even though Bob using the same probability distributions, he is still uncer- 
tain as to when the fair coin is used: 

(M,iei) ^nBoUPBob{h) = 1 / 2 ), 

The same knowledge (or lack there of) can be expressed using rough sets as: 

A = I{PBob{h) = 1 / 2 ) = 

However, 

A = K{A) = {w I [w]Bob C A} = 0 , 

since 

[wi]Bob = {wi,W 2 ,Ws,W 4 } = [w 2 ]Bob = b^s\Bob = [^ 4 ]so 6 . 

Finally, let ns determine when Alice knows that the coin is fair and also 
knows that the probability of heads is 1/2. This sentence is represented in rough 
sets as: 

pAlicei^f) P AliceiyPAlicei/P) — 1/2)* 

Now 

I{KAHce{f)) = {wi,W2}. 

By Equation (1), 

P Alicei^ki^ Alice {P) — 1/2)) — 1 ^ 2 } • 

By the definition of the incidence mapping: 

k(^pAlice{f) P P Alice {P Alice {h) — 1/2)) 

— k{pAlice{f)) P AH ce{k{ Alice (P) — 1 / 2 )) 

= {^1, ^2}. 
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6 Conclusion 

Rough sets have primarily been applied to classification problems. Recently, it 
has been shown that rough sets can also be applied to reasoning about knowl- 
edge [6, 8]. In this preliminary paper, we have added probability. This allows us 
to represent formulas such as “the probability of </>, according to player 1, is at 
least a”, where is a formula and a is a real number in [0, 1]. Thus, the only 
extension to the work in [8] is to allow formulas involving probability. 

On the other hand, our original objective was to introduce a probability 
operator P in the same spirit as the knowledge operator K in [8]. Unfortunately, 
while P behaves nicely with A", P does not always interact nicely with itself. We 
are currently working to resolve these problems. 
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Abstract. Feature subset selection is of prime importance in pattern 
classification, machine learning and data mining applications. Though 
statistical techniques are well developed and mathematically sound, they 
are inappropriate for dealing real world cognitive problems containing 
imprecise and ambiguous information. Soft computing tools like artih- 
cial neural network, genetic algorithm fuzzy logic, rough set theory and 
their integration in developing hybrid algorithms for handling real life 
problems are recently found to be the most effective. In this work a neuro- 
rough hybrid algorithm has been proposed in which rough set concepts 
are used for hnding an initial subset of efficient features followed by a 
neural stage to hnd out the ultimate best feature subset. The reduction 
of original feature set results in a smaller structure and quicker learning 
of the neural stage and as a whole the hybrid algorithm seems to provide 
better performance than any algorithm from individual paradigm as is 
evident from the simulation results. 



1 Introduction 

Selection of a good subset of available features is not only the prime concern of 
pattern classification problems but also plays an important role in the fields of 
machine learning, knowledge discovery and data mining. Irrelevant and redun- 
dant features generally affect the performance of mostly all common machine 
learning or pattern classification algorithms. A good choice of an useful feature 
subset from a vast set of features helps in devising a compact and efficient learn- 
ing algorithm for pattern classification or machine learning as well as results 
in better understanding and interpretation of data in knowledge discovery and 
data mining problems. 

The problem of feature selection got an immemnse attention from statisti- 
cal community from long back. Significant contributions [1] from statisticians 
have come in the field of pattern recognition, ranging from techniques that find 
optimal feature subset to suboptimal or near optimal solutions. Most of the sta- 
tistical approaches are based on some assumption about probability distribution 
of the data set which, in practice, rarely follows the ideal one. Presently artificial 
neural networks (ANN) are becoming popular for analysis of vast data sets [2], 
[3]. The neural approach is specially efficient when the only source of available 
information is provided by the training data. They are known to be capable of 
extracting information from raw data and generalizes well. 
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Generally ANN’s consider a fixed topology of layers of neurons interconnected 
by links in a predefined manner. Connection weights are usually initialized by 
small random values. The main drawbacks of a neural network learning system is 
that it is time consuming, specially when the number of input is large and though 
it is effective in presence of noise, the proper choice of network architecture till 
remains an unsolved problem. Recently fuzzy set theory and rough set theory 
[4] are widely used as the tools for knowledge extraction from large databases 
with imprecise and uncertain information. These soft computing tools have been 
proved to provide adaptivity and fault tolerance in modern intelligent systems. 
Hybrid systems [5], [6] have been developed by integrating the merits of different 
paradigms to handle real life problems more efficiently. 

Motivated by the improved performance with hybridization, in this work a 
neuro-rough hybrid algorithm has been proposed to solve the problem of feature 
subset selection. The theory of rough set has been used in the first stage to have 
a rough idea about useful features and their subsets from the raw data. In the 
second stage a neural network has been used with the reduced feature set for 
finding out the ultimate best subset of features. The reduction of original fea- 
ture set results in a smaller structure and quicker learning of the neural stage 
and as a whole the hybrid algorithm seems to provide better performance than 
any algorithm from individual paradigm. The algorithm has been simulated on 
two different data sets and it has been found that the present algorithm consid- 
erably reduces the time required for finding the best feature subset compared 
to our previous work reported in [7] where only neural network has been used 
for solving the problem. In the next section a brief introduction to rough set 
preliminaries and its use to feature subset selection problem has been discussed. 



2 Rough Set Theory and Feature Subset selection 

This section describes the basic concepts of rough set theory and how it can be 
used to have an initial idea of useful feature subset. The detail concepts of rough 
set theory can be found in [8]. 



2.1 Rough Set Preliminairies 

According to rough set theory an information system is a four- tuple S = 
{U,Q,VJ-) where 

G, a non-empty finite set, represents the universe of objects, 

Q, a non-empty finite set, represents the set of attributes or features, 

R, a non-empty finite set, represents the set of possible attribute or feature 
values and / is the information function which given an object and a feature, 
maps it to a value, i,e. f : U x Q ^ V. 

An information system is represented by an attribute- value table in which 
rows are labeled by objects of the universe and columns by the attributes. 

An indi seer nihility relation is an equivalence relation with respect to a set of 
attributes (features) which partitions the universe of objects into a number of 
classes in such a manner that the member of same classes are indiscernible while 
the member of different classes are distinguishable with respect to the particular 
set of attributes. 
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Let P be a subset of Q, that is, P is a subset of features. The P -indiscernihility 
relation^ denoted by IND{P)^ is defined as, 

IND{P) = {(x,y) G for every feature a E P^ = /(z/,*^)} 

Then IND{P) = f]^^pIND{a) 

For any concept (class or label) X where X C U and for any subset of 
features P, P C Q the P-lower{ P) and the P- upper approximation ( P) of X 
are defined as follows: 

P(X) = U{Y e U/IND{P) :Y CX} 

P{X) = U{T G U/IND{P) :YnX ^(f>} 

The boundary region for the concept X with respect to the subset of features 
P is defined as: 

BNDp{X) = P{X) - P(X). _ 

POSp{X) = P{X) and NEGp{X) = U - P{X) are known as P-positive 
region of X and P- negative region of X . 

If BNDp{X) = (f) then X is definable or elassifiable using P. Otherwise the 
class X is a rough set with respect to the feature subset P. 

Let P = {Xi,X2, . . . ,X^},X^ G P be a classification of P and let P G 
then PF = {PXi,PX2,...,PXn} and PF = {PXi, PX2, . . . , PX^} denote 
the P- lower and P- upper approximation of classification F (family of classes). 

Inexactness of a rough set is due to the existence of boundary region. The 
greater the boundary region, the lower the accuracy of the set. Measures for 
accuracy or approximation of a rough set are defined below. 



Approximation Measures Two measures to describe inexactness of approxi- 
mate classifications have been defined in rough set theory as follows: 

The aeeuraey of approximation expresses the possible correct decisions when 
classifying objects using attribute subset P and is defined as 



ap{F) 



cardPXi 

cardPXi 



The quality of approximation expresses the percentage of objects which can 
be correctly classified employing the attribute subset P. and is defined as 



ip{r) 



^ cardP_Xi 
cardU 



Reduct and Core of Attributes These two are fundamental concepts in 
rough set theory in connection with the knowledge reduction. A reduet denotes 
the essential part of the knowledge while eore is the most important part. The 
P-reduct of A is the minimal subset of A which provides the same classification 
of objects as the whole set A. 

An attribute or feature a G P is superfluous or redundant in P if IND{P) = 
IND{P — {a}); otherwise the attribute a is indispensible in P. P is an inde- 
pendent set of features if there does not exist a strict subset P^ of P such that 
IND{P) = IND{P^). 

A subset P (P C P) is a reduet of P if it is independent and IND{R) = 
IND{P). Each reduct has the property that a feature cannot be removed from 
it without changing the indiscernibility relation. Many reducts for a given set of 
features P may exist. 

The set of features belonging to the intersection of all reducts of P is called 
eore of P (P-core). In fact P-core is the union of all indispensible features in P. 
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Thus core{P) — ^RG/^ed^^c^(P) ^ 

The indispensable features, reducts and core can be similarly defined relative 
to the output class or label known as relative reducts or relative core. In this 
paper we will use the term reduct and core to mean relative reduct and relative 
core with respect to the output labels or classes. 



2.2 Feature Subset Selection with Rough Set 

The basic idea of the first step of the two-stage neuro-rough hybrid feature subset 
selection algorithm proposed in this paper following the concepts of rough set 
theory are represented by the following steps. 

1 . The multidimensional data (each dimension representing individual attribute 
or feature) of known classification i,e the training data is expressed as the 
attribute value table with rows as the objects or the instances and columns as 
the attribute values and the corresponding class label. As continious feature 
value is difficult to handle by the proposed algorithm the feature values 
are discretized to some predefined levels. Let the number of attributes be 
Q and the number of classes be n in the data set where the number of 
instances are N . The objective is to select the subset of attributes P in such 
a way that minimizes BNDp{Xi) for all i corresponding to the classification 

2. Now the indiscernibility relation induced by any feature (a G Q) or a subset 

of features P from the feature set Q and the corresponding classification 
(partition) is examined. The feature subset P for which the accuracy mea- 
sures ap{F) and ^p{P) of the classification F = {Xi,X 2 ,. . . G U 

are highest, is selected as an approximation for the possible good feature 
subset. 

3. In the next step it is examined whether the selected feature subset P has 
any relative reduct Pi,P 2 • • •, relative to the classification considering the 
attribute- value table or the discernibility matrix. 

4. The reduct of P obtained in the previous step (if multiple reduct exists, 
reduct containing the fewer number of features) is considered in the next 
step for presenting as inputs to neural network for finding out the ultimate 
best feature subset. 

3 Neural Network for Best Feature Subset Selection 

A fractal neural network model, a modified version of feedforward multilayer 
perceptron with statistically fractal connection structure, used earlier and re- 
ported in [7] has been used here. The proposed model and the feature subset 
selection algorithm by using it is presented in short in the next subsections. 



3.1 Fractal neural network model 

The fractal neural network model is a modified version of feedforward multilayer 
neural network in which upper layer neurons are connected to the lower layer 
neurons with a probability following an inverse power law which generates a 
sparse network with statistically fractal connection structure. However the final 




Feature Subset Seleetion by Neuro-rough Hybridization 523 



hidden layer is fully connected to the output layer. Each layer is an array of 
neurons in one or two dimension depending on the type of input to be processed. 
The probability that ith processing element in the A:th layer receives connection 
from the yth processing element of the previous layer, defined by CPijk follows 
the law 



( 1 ) 

i — 1,2... 7lf^ 

j — 1,2... 

0 ^ Df^ ^ d 

where rijk is the Euclidean distance between ith processing element in the 
A:th layer (considering one dimensional layers) and jth processing element of the 
previous layer defined as 

'^ijk — llQ-ife — I)!!? '^ijk ^ 1 (2) 

d denotes dimension of the array of neurons in A:th layer. A represents a con- 
stant, Dk represents the fractal dimension (similarity dimension) of the synaptic 
connection distribution of k the layer. and denotes the spatial po- 

sition of the ith processing element in the A:th layer and jth processing element 
of the previous layer defined by 

Qik = [\nk-i{2i - l)/2nk],k] for i = 1, 2, . . . , (3) 

where nk-i and nk represents the number of neurons in the {k — l)th and kth 
layers respectively. 

To implement such a sparse neural network, for each A:, a uniform random 
number p on the interval [0,1] has to be generated and the connectivity Cijk of 
the link from the Th processing element in the A:th layer to the jth processing 
element of the previous layer is to be assigned as 



Cijk = 1, if CPijk > P (4) 

= 0, Otherwise 

The operation of the network is similar to the operation of any multilayer 
feedforward backpropagation network. The connection structure of the network 
allows low probability of long range connection links and high probability of 
short range connection links. 



3.2 Feature Subset Selection Algorithm 

A simple algorithm for selecting best feature subset by the proposed fractal 
neural network has been presented below. The network is trained for optimum 
efficiency determined by the highest classification rate for the problem at hand 
by suitable set up of the different parameters using the feature subset selected 
in the first step as rough set reduct. The features are then removed one by one, 
selection for their removal is done by examining the change in classification rate. 
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Depending on the problem and the required classification accuracy, the final 
subset of features has to be determined. The actual steps of the algorithm are 
as follows. 

1. For a selected feature subset of P features, a fractal neural network with the 
input layer of P neurons and the output layer of n neurons (for a n class 
problem) is set up. The number of hidden layers, the number of neurons in 
each hidden layer and the fractal dimension of the synaptic connections are 
chosen heuristically by trial. The connection structure is set up according to 
Eq. 1 and Eq. 4 with the proper selection of the value of d and A. 

2. The network is trained several times for selecting the optimum number of 
hidden layers, the number of neurons in each hidden layer and the optimum 
value of the fractal dimension for which the classification rate for the test 
samples is the highest. This optimum net configuration is retained for later 
steps. 

3. The fractal network set up in the previous step is retrained with the subset of 
input features one less from the initial subset of P features. The classification 
rate with the one less input is calculated with the same test samples. 

4. All the inputs are removed one by one and the whole procedure of the pre- 
vious step is repeated. The inputs are ranked according to the classification 
rate of the network without that particular input. The highest classification 
rate obtained, corresponds to the most irrelevant input. 

5. After removal of the most irrelevant input selected in the previous step, step 
3 and step 4 is repeated for removal of the next irrelevant feature. 

6 . The process is stopped when any one of the following stopping criteria is 
met. 

(a) The total number of features attains a pre-assigned limit. 

(b) The classification score falls below a preassigned limit. 

4 Simulation and Results 

The proposed algorithm has been simulated by two data sets. Sonar data set 
used for underwater target recognition [9] and Iris data set [10], commonly used 
to test pattern recognition problems. 



4.1 Simulation with IRIS data 



This data set contains three classes each with 50 sample vectors. Each sample has 
four feature vectors ( Fi, F 2 , F 3 & F 4 ). As the number of features in this case are 
small, the feature set has been extended and twelve features have been generated 
from the primary four features and all togther sixteen features are considered 
as the feature set for our experiment. The generated features according to the 
increasing order of feature number are (Fi, F 2 , (F 3 , ~ P‘ 2 : Pi — ^ 3,^1 — 
F4,F'2 - F 4 ,Fs - F 4 , Fi/ F 2 , Fi/ F s, Fi/ F 4 , F 2 / F's, F 2 / F 4 , Fs/ F 4 ,). 

Following rough set theoretic concepts, the initial approximation of the best 
feature subset has come out to be the following subset 

(3,4,8,9,11,12,5,6,14) 

In the second stage a the fractal network with 9 neurons in the input layer and 
3 neurons in the output layer has been set up. One hidden layer with different 
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number of neurons (2, 4, 6) and different values of fractal dimension ranging from 
0.85 to 0.98 has been used for experiment. The initial weight values are selected 
randomly from 0.1 to —0.1. The network with one hidden layer of 4 neurons 
and fractal dimension 0.9 has been chosen for the optimum network for feature 
selection. The final best feature subset came out to be (3, 4, 8, 9). 

Table 1 represents the comparison in terms of time and recognition score of 
the present algorithm and the algorithm presented in [7] in which only fractal 
neural network have been used as the tool feature subset selection. The table 
shows that the hybrid neuro-rough algorithm performs better than only neural 
algorithm in terms of time though the recognition rate and the ultimate selected 
feature subset are same. 



No. of 


Average recognition score 


Time taken 


features 


for neural 


for neuro-rough 


for neural 


for neuro-rough 


in subset 


algorithm 


algorithm 


algorithm 


algorithm 


9 


97.2% 


96.8% 


1.38 hrs 


0.25 hrs 


4 


98.2% 


98.2% 


1.72 hrs 


0.58 hrs 



Table 1. Comparison of Neural and Neuro-Rough algorithm for IRIS data set 



4.2 Simulation with SONAR data 

This data set is produced from taking 60 sample points per signal (making 60 
features) from power spectral envelope of sonar returns from two types of targets. 
These samples were normalized to take on values between 0.0 and 1.0, details 
can be found in [9]. 

For this data set, initial approximation of the efficient feature subset following 
rough set theoretic algorithm has come out to be a set of 15 features. In the 
second stage the fractal network with 15 neurons in the input layer and 2 neurons 
in the output layer has been used. The connection structure has been set up 
according to Eq. 1 and Eq. 4 with the values of A and d taken as 1 as before. The 
number of neurons in the hidden layer is varied between 4 to 10 for experiment. 
The value of the fractal dimension has also been varied (from 0.8 to 0.95 ) to find 
out the optimum connection structure of the fractal network in this particular 
problem. Table 2 represents the values of time taken and the recognition score 
of the present algorithm and the neural only algorithm for Sonar data set. The 
table shows that the hybrid neuro-rough algorithm has better performance in 
the case of Sonar data also than only neural algorithm in terms of time while 
the recognition rate and the number of features in the ultimate selected feature 
subset are more or less same. 

5 Conclusion 

Feature subset selection is very important in pattern classification, machine 
learning or data mining problems. Most of the collected real data set contains 
redundant or irrelevant information. While statistical techniques to the problem 
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No. of 


Average recognition score 


Time taken 


features 


for neural 


for neuro-rough 


for neural 


for neuro-rough 


in subset 


algorithm 


algorithm 


algorithm 


algorithm 


15 


90.2% 


93.8% 


7.12hrs 


0.57 hrs 


5 


98.2% 


98.2% 


7.48 hrs 


1.67 hrs 



Table 2. Comparison of Neural and Neuro-Rough algorithm for SONAR data set 



of the best feature subset selection are well known and mathematically strong, 
they are computationally unattractive specially in case of real world large data 
set problems. Artificial neural networks are nowadays becoming popular tools in 
pattern classification. 

In this work a hybrid two stage feature subset selection has been proposed 
to lessen time and computational burden. In the first stage rough set theoretic 
concepts are applied to extract information from the raw data set to find out 
approximate set of efficient features. In the second stage a fractal neural network 
model has been used to find out the ultimate best feature subset. As the number 
of features are reduced in the first stage the time taken for finding out the best 
feature subset is comparatively less than the neural only approach to feature 
subset selection problem. The simulation results also reflect the benefit of hy- 
bridization. Though extensive simulations by different data sets, specially from 
real world applications, are yet to be done, the proposed hybridization clearly 
shows a way for quick algorithm for solution of the feature subset selection prob- 
lem. 
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Abstract. We propose an active block matching algorithm for motion 
estimation. The proposed algorithm dynamically determines the search 
area and the matching metric. We exploit the constraint of small ve- 
locity changes of a block along the time to determine the origin of the 
search area. The range of the search area is adjusted according to the 
motion coherency of spatially neighboring blocks. Our matching met- 
ric includes multiple features. The degree of overall match is computed 
as the weighted sum of matches of individual features. We adjust the 
weights depending on the distinctiveness of features in a block, so that 
we may discriminate features according to the characteristics of an in- 
volved block. The experimental results show that the proposed algorithm 
can yield very accurate block motion vectors. 



1 Introduction 

The technique of block motion estimation is currently favored by many re- 
searchers in the field. The process of block matching is to find a candidate block, 
within a search area in the previous frame, that is most similar to the current 
block in the present frame, according to a predetermined criterion [1,2]. In this 
paper, we propose an active search algorithm for a candidate block. The search 
origin for each block is adjusted by a motion vector of the block in the previous 
frame to make use of the constraint of small velocity changes of a block along 
the time. We also adjust the range of the search area according to the motion 
coherency of spatially neighboring blocks. A smaller search area will be assigned 
to a block having more coherent motion in its neighboring blocks. 

Most block matching algorithms just consider the difference of color or gray 
intensities of corresponding blocks when they compute the degree of match 
[3,4,5]. This criterion of match may be acceptable for the case of video cod- 
ing, since the primary concern of coding is to reduce the redundancy between 
successive frames. However, when we need an accurate estimation of block mo- 
tion vectors as in video conference, it may cause the problem. To resolve such a 
situation, we involve multiple features in a matching metric. The degree of over- 
all match is computed as the weighted sum of matches of individual features. 
We adjust the weights depending on the distinctiveness of features in a block, so 
that we may discriminate features according to the characteristics of an involved 
block. 
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2 Adaptive Setting of Search Area 



The blocks to be examined in the previous frame are within a search area whose 
origin and size are determined by exploiting the motion vectors of blocks in the 
previous frame. We denote a search area as in (1). 

SA = (p(x,y),s(x,j/)) = {{x,y){xs,ys)) (1) 

where p(x,y) denotes the origin of a search area and s(x,y) denotes its size. 
In most image sequences, motions are smooth and slow- varying. Discontinuity 
of motion vectors only occurs at the boundary of objects moving in different 
directions [3]. Since the moving object often covers several blocks, motion vec- 
tors between adjacent blocks are highly correlated. In our proposed approach, 
we utilize these characteristics. The blocks of the previous frame already have 
motion vectors, since their corresponding blocks in the second previous frame 
have been identified. We presume that the motion vector of a block is likely to be 
similar to the motion vectors of its neighboring blocks. We also presume that the 
motion of a block does not change rapidly along a relatively small time interval. 
We therefore use, as the origin of a search area, the location (i.e., block) in the 
previous frame which points to the current block by its motion vector. 



(x, y) = P£.p.e„ (x, y) + (2) 

In (2), denotes the block-j in the previous frame whose motion vec- 
tor, points to the current block-i in the present frame, . This 

equation depicts how the search origin of is computed. 

Typically, only a small number of blocks have a large displacement in most 
image sequences. Therefore it is not efficient to fix the search range for each 
block. We take advantage of the inter-block motion correlation to adaptively 
determine the size of a search area. The size of a search area is allowed to vary 
within its maximum range of Sniax(^,y) and its minimum range of Sniin(^,y) as 
in (3). 



SBc„.(x,y) =Smm(Ay) + (1 • (Sma.(x,y) (3) 

In (3), CF{MV{BY(^^^)) is a certainty factor that reflects the reliability of the 
motion vector MV . It is designed to have a value between 0 and 1, so that 
the size of range is adjusted depending on the reliability of the related motion 
vector. This strategy is based on the assumption of slow- varying motion. To 
determine the reliability of a motion vector, we utilize the smoothness constraint 
of motion. We represent the motion coherency of spatially neighboring blocks in 
a form of a certainty factor. 



CF{MV{BY^^^)) 



Ki 

1+K^2-VD{BYY) 



vd(byY) = ||Mi/(sy7) -mIP • V 



/r = mean of MV ^ neighborhood of j(i) 

= variance of MV{Bjl^^)^ j* G neighborhood of j(i) 



( 4 ) 
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In (4), denotes the variance-compensated distance of 

from the mean of its neighboring motion vectors. becomes small when 

is close to the mean and the variance is small. In other words, if 
the motion coherency of spatially neighboring blocks is high and also the motion 
vector under consideration is close to the mean motion vector of the neighboring 
blocks, then becomes small and CF{MV{B^^^^^)) gets large. A large 

certainty factor indicates that MV{B^^^^) is highly reliable and the size of a 
search area can be reduced. 



3 Adaptive Setting of Matching Metric 



Given a block of size N x N, the block motion estimation looks for the best 
matching block within a search area. One can consider various criteria as a mea- 
sure of the match between two blocks [6]. We claim that the intensity difference 
between two blocks may not provide an accurate estimation of block motion, 
since it does not consider the internal structure. We suggest to involve various 
types of multiple features in a matching metric. 

When multiple features are used in a matching metric, one has to take into 
consideration the following two issues. The first is the issue of normalizing the 
scale of features. The second issue is how to properly weigh the features according 
to their importance. We normalize each feature by dividing it with the highest 
value that it can have, so that the similarity of an individual feature between two 
blocks ranges from 0 to 1. For example, at each search point, the displaced block 
similarity (DBS) according to the k-th feature, /fc, is computed as in (5). In 
(5), the index denotes the block at (i,j) in the present frame, the index 

(n — + + denotes a candidate block at (i + x, j + y) within a search area 

in the previous frame, and the displacement (x,y) denotes the corresponding 
disparity between two blocks. 



DBS{fk,i,j]x,y) 



(5) 






N-lN-l 



AT2 



=0 v =0 



fk{n\i + u,j + v) - fk{n - l\i + x + u,j + y + ' 

fk max 



The overall displaced block similarity (ODBS) is then formed as the weighted 
sum of the similarities of individual features as in (6). The candidate block 
that maximizes (6) is selected as the best matched block and the corresponding 
displacement (x,y) becomes the motion vector of the block (n;i,j). 



ODBS{i,j]x,y) = ■ DBS{fk]i,j]x,y) 



( 6 ) 



To determine the weights we use the entropy value of the corresponding 
feature fk in the search area under consideration. We compute weights of features 
as the normalized entropies as in (7), so that they have values from 0 to 1 and 
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the sum of them becomes 1. In (7), denotes the entropy of the corresponding 
feature and P(-) denotes the probability density of the feature value which is to 
be evaluated in a given search area. 









= -^P{fk{n]i,j)) ■ \iiP{fk{n]i,j)) 



(7) 



4 Experimental Results and Discussions 

In this section, we evaluate the performance of the proposed active block match- 
ing algorithm in terms of the accuracy of resulting motion vectors. As for features 
of each block, we used three different types; brightness, gradient, and laplacian. 

Fig. 1 shows two adjacent frames in sequence of test images. In this sequence, 
frames are captured with such camera operations as the rotation by two degrees 
per frame in a clockwise direction, translation by two pixels per frame in a 
southeast direction, and zooming by 1.05 magnification per frame. Fig. 2 depicts 
motion vectors for the images of Fig. 1. Ideally, the motion vectors should diverge 
out in a form of spiral whose origin is a couple of pixels off to a southeast 
direction. We can clearly see that our approach outperforms others. 

In this paper, we have presented an active block matching algorithm for 
motion estimation. Our algorithm dynamically determines the search area and 
the matching metric. Experimental results show that our algorithm outperforms 
other algorithms in terms of accuracy of the estimated motion vectors, though 
our algorithm requires some computational overhead. 
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(a) Input image at t (b) Input image at t + At 



Fig. 1. Test images with multiple camera operations 
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(a) Full Search method 



(b) Three Step Search method 




(c) Four Step Search method 



(d) Proposed method 



Fig. 2. Estimated motion vectors 
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Abstract. Feature seleetion is used to improve performanee of learning 
algorithms by finding a minimal subset of relevant features. Sinee the proeess 
of feature seleetion is eomputationally intensive, a trade-off between the quality 
of the seleeted subset and the eomputation time is required. In this paper, we 
are presenting a novel, anytime algorithm for feature seleetion, whieh gradually 
improves the quality of results by inereasing the eomputation time. The 
algorithm is interruptible, i.e., it ean be stopped at any time and provide a 
partial subset of seleeted features. The quality of results is monitored by a new 
measure: fiizzy information gain. The algorithm performanee is evaluated on 
several benehmark datasets. 



Keywords: feature seleetion, anytime algorithms, information-theoretie 
network, fuzzy information gain 



1 Introduction 

Large number of potential features constitutes a seriously obstacle to efficiency of 
most learning algorithms. Such popular methods as k-nearest neighbors, C4.5, and 
backpropagation do not scale well in the presence of many features. Moreover, some 
algorithms may be confused by irrelevant or noisy attributes and construct poor 
classifiers. A successful choice of features provided to a classifier can increase its 
accuracy, save the computation time, and simplify its results. 

In practical applications, like data mining, there is no better solution than using the 
knowledge of a domain expert, who can identify manually all relevant predictors of a 
given variable. However, in many learning problems, such an expert is not available, 
and we have to use automated methods of feature selection that choose an optimal 
subset of features according to a given criterion. A detailed overview of feature 
selection methods is presented by Liu and Motoda (1998). 

Since classification accuracy is an important objective of learning algorithms, the 
most straightforward method (called the wrapper model) is to evaluate each subset of 
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features by running a classifier and measuring its validation accuracy. Obviously, this 
approach requires a considerable computation effort. Another approach (the filter 
model) uses indirect performance measures (like information, distance, consistency, 
etc.). The filter algorithms are computationally cheaper, but they are evaluating 
features in a random order, which makes their intermediate results hardly useful. 

Whether the wrapper model or the filter model is applied to a set of features, the 
user may have to stop the execution of the algorithm, because there is no more time 
left for continuing the computation. Moreover, the time constraints may be unknown 
in advance and they can vary from seconds in real-time learning systems to hours or 
days in large-scale knowledge discovery projects. In both cases, we may be interested 
to find a good, but not necessarily the optimal, set of features as quickly as possible. 
However, as appears from (Liu and Motoda, 1998), the existing methods of feature 
selection do not consider the trade-off between time and performance. 

Anytime algorithms (e.g.. Dean and Boddy, 1988, Horvitz, 1987, Russell and 
Wefald, 1991, Zilberstein, 1996) offer such a trade-off between the solution quality 
and the computational requirements of the search process. The approach is known 
under a variety of names, including flexible computation, resource bounded 
computation, just-in time computing, imprecise computation, design-to-time 
scheduling, or decision-theoretic metareasoning. All these methods attempt to find the 
best answer possible given operational constraints. A formal model for anytime 
algorithms is provided by $-calculus (Eberbach, 2000), which is a higher-order 
polyadic process algebra with a utility (cost) allowing to capture bounded 
optimization and metareasoning typical for distributed interactive AI systems. 

In section 2, we are describing the information-theoretic connectionist method of 
feature selection, initially introduced by us in (Maimon, Kandel, and Last, 1999) and 
(Last and Maimon, 1999). This paper shows for the first time that the method is much 
faster than the wrapper techniques and it can be implemented as an anytime 
algorithm, when the computation time is limited. Section 3 reports initial experiments 
that study the performance of the information-theoretic method and suggests possible 
enhancements using $-calculus. Finally, in section 4 we summarize the benefits and 
the limitations of our approach and discuss some directions for future research in the 
field of resource-bounded feature selection. 



2 Information-Theoretic Method of Feature Selection 

Our method selects features by constructing an information-theoretic connectionist 
network, which represents interactions between the predicting {input) attributes and 
the classification {target) attributes. The minimum set of input attributes is chosen by 
the algorithm from a set of candidate input attributes. The network construction 
procedure is outlined in sub-section 2.1. The theoretical properties of the algorithm in 
the context of anytime computation are evaluated in sub-section 2.2. 



2.1 Network Construction Algorithm 

An information-theoretic network is constructed for each target attribute separately. It 
consists of the root node, a changeable number of hidden layers (one layer for each 
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input attribute), and a target layer. Each hidden (target) layer consists of nodes 
representing different values of an input (target) attribute. The network differs from 
the structure of a standard decision tree (see Quinlan, 1986 and 1993) in two aspects: 
it is restricted to the same input attribute at all the nodes of each hidden layer and it 
has interconnections between the terminal (unsplitted) nodes and the final nodes, 
representing the values of the target attribute. 

The network construction algorithm starts with a single-node network representing 
an empty set of input attributes. A node is splitted if it provides a statistically 
significant decrease in the conditional entropy of the target attribute (based on a pre- 
defined significance level). A new input attribute is selected to maximize the total 
significant decrease in the conditional entropy. The nodes of a new hidden layer are 
defined for a Cartesian product of splitted nodes of the previous hidden layer and 
values of the new input attribute. If there is no candidate input attribute significantly 
decreasing the conditional entropy of the target attribute, then the construction stops. 
Detailed descriptions of the algorithm steps are provided in (Maimon, Kandel, and 
Last, 1999) and (Last and Maimon, 1999). 

An example of an information-theoretic connectionist network, which has three 
hidden layers (related to three selected attributes), is shown in Fig. 1. The 
performance of the algorithm is evaluated in Section 3 below. 



Layer 1 Layer 2 Layer 3 Target layer 

(Other investments ) (Balanee) (Bank Aeeount) (Class) 




Fig. 1. Information-Theoretic Network: Credit Dataset 



2.2 Anytime Properties of the Information-Theoretic Algorithm 

According to Zilberstein (1996), the desired properties of anytime algorithms include 
the following: measurable solution quality, which can be easily determined at run 
time, monotonicity (quality is a non-decreasing function of time), consistency of the 
quality w.r.t. computation time and input quality, diminishing returns of the quality 
over time, interruptibility of the algorithm (from here comes the term any time), and 
preemptability with minimal overhead. Thus, measuring the quality of the 
intermediate results is the key concept of anytime algorithms. 

To represent the automated perception of the network quality, we will use here a 
new measure, called fuzzy information gain, which is defined as follows: 
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FGAIN= ^ ’ 



PaH{A^ //,) 
M(A,;I^ 



( 1 ) 



Where 

H (A. /I) - estimated conditional entropy of the target attribute A., given the set of 
input attributes /. 

M (A. ; I) - estimated mutual information between the target attribute A. and the 
set of input attributes /. 

a - significance level, used by the algorithm 

P - scaling factor, representing the perceived utility ratio between the significance 
level and the estimated mutual information. The meaning of different values of P is 
demonstrated in Fig. 2. The shape of FGAIN (MI) varies from a step function for low 
values of P (about 1) to almost a linear function, when P becomes much higher (about 
500). Thus, P can be used to represent the level of user-specific quality requirements. 
A general fuzzy-theoretic approach to automating the human perception of data is 
described in (Last and Kandel, 1999). 




MI 



Fig. 2. Fuzzy Information Gain as a function of MI, for three different values of p. 

Interpretation. FGAIN is defined above as a continuous monotonic function of 
three parameters: a, H (A. / 1), and MI (A. ; I). It is non-increasing in the significance 
level a, because lower a means higher confidence and, consequently, higher quality. 
In the ideal case a = 0, which implies that FGAIN is equal to one. FGAIN is also non- 
increasing in the conditional entropy H (A. / 1), because lower conditional entropy 
represents lower uncertainty of the target attribute, given the values of the input 
attributes. If the target attribute is known perfectly {H (A. / 1) = 0), FGv4/A obtains 
the highest value (one). On the other hand, FGAIN is non-decreasing in the mutual 
information MI (A.; I) that represents the decrease in the uncertainty of the target. 
When MI (A. ; I) becomes very close to zero, FGv4/A becomes exponentially small. 

Now we need to verify that our method of feature selection has the desired 
properties of anytime algorithms, as defined by Zilberstein (1996). The conformity 
with each property is checked below. 
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• Measurable quality. According to equation (1), the Fuzzy Information Gain 
can be calculated directly from the values of conditional entropy and mutual 
information after each iteration of the algorithm. 

• Recognizable quality. In (Last and Maimon, 1999), we have shown that the 
mutual information can be calculated incrementally by adding the 
conditional mutual information of each step to the mutual information at the 
previous step. This makes the determination of FGAIN fast. 

• Monotonicity. A new attribute is added to the set of input attributes only if 
it causes an increase in the mutual information. This means that the mutual 
information is a non-decreasing function of run time. Since one can easily 
verify, that the Fuzzy Information Gain is a monotonic non-decreasing 
function of MI, the monotonicity of the quality is guaranteed. 

• Consistency. The theoretical run time of the algorithm has been shown by us 
in (Last and Maimon, 1999) to be quadratic-logarithmic in the number of 
records and quadratic polynomial in the number of initial candidate input 
attributes. In the next section, we are going to analyze experimentally the 
performance profile of the algorithm on datasets of varying size and quality. 

• Diminishing returns. This property is very important for algorithm’s 
practical usefulness: it means that after a small part of the running session, 
the results are expected to be sufficiently close to the results at completion 
time. We could prove this property mathematically, if we could show that 
the mutual information is a concave function of the number of input 
attributes. Though the last proposition is not true in a general case, it is 
possible to conclude from Fano’s inequality (see Cover, 1991) that the 
mutual information is bounded by a function, which behaves this way. This 
conclusion is confirmed by the results of the next section. 

• Interruptibility. The algorithm can be stopped at any time and provide the 
current list of selected attributes. Each iteration forms, what is called, a 
contract anytime algorithm, i.e. the corrections of FGAIN diXQ available only 
after termination of an iteration. 

• Preemptability. Since the algorithm maintains the training data, the list of 
input attributes, and the structure of the information-theoretic network, it can 
be easily resumed after an interrupt. If the suspension is expected to be long, 
all the relevant information may be stored in files on a hard disk. 



3 Experimental Results 

According to Zilberstein (1996), the performance profile (PP) of an anytime algorithm 
denotes the expected output quality as a function of the execution time /.To study the 
performance profile of the information-theoretic method for feature selection, we 
have applied it to several benchmark datasets, available from the UCI Machine 
Learning Repository (Blake and Merz, 1998). Rather than measuring the absolute 
execution time of the algorithm on every dataset, we have normalized it with respect 
to the completion time, which is the minimal time, when the expected quality is 
maximal (Zilberstein, 1993). Obviously, this relative time is almost independent of 
the hardware platform, used for running the algorithm. 
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Relative Time 



Fig. 3. Performanee profile of the information-theoretie algorithm 

We have used seven datasets for our analysis (see Table 1), but in two datasets 
(Breast and Iris), the run time was too short to be detectable by the computer system 
(Pentium II 400 MHZ). Thus, we are presenting in Fig. 3 performance profiles for 
five datasets only. Two important observations can be made from this chart. First, 
we can see that FGAIN is a non-decreasing function of execution time. The second 
observation is about the diminishing returns: except for the Chess dataset, the 
performance profiles are concave functions of time. We have explained the 
theoretical background of this result in sub-section 2.2 above. 

The number of selected features in each dataset and the absolute execution times 
are shown in Table I. The size of the datasets varies between 150 and 3,196 cases. 
The total number of candidate input attributes is up to 36, including nominal and 
continuous features. On average, less than 30% of the attributes have been selected 
by the algorithm, when it was run to its termination. The completion time starts with 
undetectable (less than 0.1 sec.) and goes up to 1.65 sec. for the Diabetes dataset, 
which has 768 records and 8 continuous attributes. These times are significantly 
lower than the execution times of a wrapper selector, which may vary between 16sec 
and several minutes for data sets of similar size (see Liu and Motoda, 1998). 

Another question is how useful are the selected features for the classification task? 
The selected features can be considered useful, if a classifier’s accuracy remains at 
approximately the same level. To verify this assumption, we have partitioned each 
dataset into training and validation records, keeping the standard 2/3 : 1/3 ratio (Liu 
and Motoda, 1998). The C4.5 algorithm (Quinlan, 1993) has been trained on each 
dataset two times: before and after feature selection. The error rate of both models has 
been measured on the same validation set. The minimum and the maximum error 
rates have been calculated for a 95% confidence interval. As one can see from Table 
2, the error rate of C4.5 after feature selection is not significantly different from its 
error rate with all the available features. Moreover, it tends to be slightly lower after 
applying the feature selection algorithm. One exception is the Chess dataset, where 
the error rate has increased beyond the upper bound of the confidence interval. Due 
to the feature selection procedure, the stability of the error rate is accompanied, in 
most datasets, by a considerable reduction in the size of the decision tree model 
(measured by the number of tree nodes). 

The novelty of our approach is that it allows capturing the trade-off between the 
solution quality and the time saved and/or complexity of classification represented by 
the number of input attributes. This can be crucial for classification algorithms 
working with a large number of input attributes, or with real time constraints. 
Alternative quality measures and costs of meta-reasoning can be studied in the 
process algebra framework provided by $-calculus (Eberbach, 2000) which formalizes 
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anytime algorithms. For example, in terms of $-calculus expressing the tradeoff 
between the quality solution and the time, can be thought as a new measure 
FGAIN^^=(l-t) FGAIN, where / is a normalized execution time (assuming that the 
execution time is bounded), or alternatively out in FGAIN can be modified. 



Table 1. Feature seleetion: summary of results 



Dataset 


Data 

Size 


Classes Continuous Nominal 


Total 

Attributes 


Selected 

Attributes 


Completion 
Time (sec.) 


Breast 


699 


2 


9 


0 


9 


3 


- 


Chess 


3196 


2 


0 


36 


36 


9 


0.28 


Credit 


690 


2 


6 


8 


14 


3 


1.04 


Diabetes 


768 


2 


8 


0 


8 


4 


1.65 


Glass 


214 


6 


9 


0 


9 


3 


0.61 


Heart 


297 


2 


6 


7 


13 


3 


0.22 


Iris 


150 


3 


4 


0 


4 


1 


- 


Mean 


859 


3 


6 


7 


13.3 


3.6 


0.76 



Table 2. Error rate and tree size of C4.5 before and after feature seleetion 



Dataset 


Validation 

Items 


Before F.S. 
Tree Size 


Error Rate 


Min. 


Max. 


After F.S. 
Tree Size 


Error Rate 


Breast 


204 


29 


5.4% 


2.3% 


8.5% 


19 


4.9% 


Chess 


1025 


45 


1.3% 


0.6% 


2.0% 


29 


3.0% 


Credit 


242 


26 


14.5% 


10.0% 


18.9% 


3 


14.0% 


Diabetes 


236 


63 


28.4% 


22.6% 


34.1% 


23 


23.3% 


Glass 


71 


39 


36.6% 


25.4% 


47.8% 


39 


33.8% 


Heart 


93 


33 


19.4% 


11.3% 


27.4% 


16 


24.7% 


Iris 


49 


9 


0.0% 


0.0% 


9.5% 


5 


2.0% 



4 Summary 

In this paper, we have presented a novel algorithm for feature selection, which can be 
interrupted at any time and provide us with a partial set of selected features. The 
quality of the algorithm results is evaluated by a new measure, the fuzzy information 
gain, which represents the user perception of the model quality. The performance 
profile of the algorithm has been shown to be a non-decreasing and mostly concave 
function of execution time. The quality of the final output has been confirmed by 
applying a data mining algorithm (C4.5) to a set of selected features. 
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Topics for future research include consideration of alternative quality measures, 
predicting expected quality for a given run time (and vice versa), and integrating 
anytime feature selection with real-time learning systems. 
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Abstract. Conventional researches on image restoration have focused 
on restoring blurred images to sharp images using frequency hltering or 
video coding for transferring images. 

In this paper, we proposes a method for recovering original images using 
camera motion and video information such as caption regions and scene 
changes. The method decides the direction of recovery using the cap- 
tion information and scene change information. According to direction 
of recovery, a rough estimate of the direction and position of the original 
image is obtained using calculated motion vector from camera motion. 
Because the camera motion dose not reflects local object motion, some 
distortion can happen in the recovered image. To solve this problem, 
block matching algorithm that is applied in units of caption character 
components on the obtained recovery positions. Experimental results 
show that the case of images having little motions is well recovered. We 
see that the case of images having motion in complex background is also 
recovered. 



1 Introduction 

Captions are frequently inserted into broadcast images or video images to 
aid the understanding of audience. For such images already broadcast, it is some- 
times necessary to remove the captions and recover the original images. When 
the number of images requiring such recovery is small, manual processing is 
possible, but as the number grows it would be very difficult to do it manually. 
Therefore, a method for recovering original image data for the caption areas is 
needed. Research on image restoration has focused on restoring blurred images 
to sharp images using frequency filtering [1] or video coding for transferring 
images [2] . Other research include recovery of cultural heritage using interpo- 
lation [3] . The method used is based on lines and therefore are not suitable for 
caption areas with larger sizes. Restoration methods using BMA(Block Matching 
Algorithm) [4] are done by simple comparison with previous frames such that 
errors can propagate. 
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2 A prior-information extraction for reconstructing 
original images 

To reconstruct the lost region occupied by caption areas, we extract a prior- 
information in videos such as the information for caption area, cut detection and 
camera motion. 

The caption information consists of the start frame and the end frame of 
caption and character components of extracted caption areas. The caption ex- 
traction method [5] we use is based on graph-theoretic clustering . We can decide 
the direction and the starting point for recovering using the information for the 
start frame and the end frame. And the extracted character components are used 
as basic units for recovering. 

The information for cut detection is used to decide the direction and the 
ending point for recovery. We use the method [6] for detecting cuts in video. This 
method proposes a new algorithm of detecting cuts using motion vector(Mv). 
The motion vector(Mv) consists of magnitude( M) and direction( D). 

The information for camera motion is used to decide the location of the 
reference original image. We use a method based on the method [7] to ex- 
tract information for camera motion in video. The camera motions are classified 
into Pr( Panning-right)^ Pl( Panning-left)^ Tu(Tilting-up)^ T d( Tilting- do wn)^ and 
Zm(zooming). 



3 Reconstruction of the lost region occupied by caption 
areas 

Because the frame just before the caption appeared or the frame just after 
the caption disappeared has the original image, we find a position of the start 
frame and the end frame(Td-/h), and use the frames as the basis for 
recovery. 

The direction of recovery is to decide which direction to proceed for recovery 
starting from the start frame or the end frame. We decide the end points of 
recovery in relation to the information for cut detection. There are three cases. 

In first case, there is no scene change, the order for recovery is from the start 
frame to the middle frame of caption and from the end frame (Td-/r^) to 
the middle frame. 

In second case, if there is a scene change, the order is from the start frame(*9T 
fri) to the scene change frame and from the end frame(Td-/r^) to the scene change 
frame. 

In third case, if there are more than two scene changes, the direction of recov- 
ering is from the start frame to the scene change in the forward 

direction, and from the end frame (Td-/r^) to the scene change in the re- 
verse direction. In this paper, we don’t process frames between scene changes. 
Because the frames between the first and second scene change have no original 
image for reference and the recovered character region is too big, the traditional 
method or our method is not able to recover the original image. Therefore, these 
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frames should be processed by another method. 

We recover the original image for caption area by using extracted video infor- 
mation(caption information and scene change information) and camera motion 
information. The character components are extracted from extracted caption 
areas as shown in Fig 1. They are used as the basic units for recovery. The cam- 
era motion information gives each frame general information for camera motion 
and motion vector information which are the motion direction and magnitude 
as shown in Fig 2. Here No stands for no camera operation. 




Fig. 1. Extracting character components in extracted caption region 
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Directiofl for recovery 
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Fig. 2. Camera motion and direction information for recovery 



A method for original image recovery for caption area is that firstly we find 
where the character component in start frame (yS'T/r^) is located in the ogrinal 
image. The caption areas are replaced by founded orignal areas. We recover 
current frame (yS'T/r^) for all character components by above processing step. 
The recovered caption area is used for recovering the next fTome{St-fri+l). If 
the recovery is finished from the start frame to REdi^ then the recovery is done 
from the start frame (Td-/r^) to REd ‘2 in reverse direction. 

However how can we find where character components are located in the 
original image ? We may find it using camera motion information. But because 
we determined the camera motion from the whole images, we are not able to 
reflect local object movements, which is a cause of distortion in recovery. To solve 
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this problem, we use BMA. Block matching is very popular and used for motion 
compensation. Our matching criterion is MAD (Minimum Absolute Difference) 
criterion. The MAD is defined as in formula( 1). 

Block matching compares the contour pixels of character component with the 
original image which has no caption. And then we find the position of minimum 
color distance with original image. If we find the position for minimum color 
distance in the original image, the color of the recovered caption region is taken 
from the position. 



MAD{di,d 2 )= \ F{x,y,t) - F{x + di ,y + d 2 ,t - 0 \ ( 1 ) 

{x,y)eR 

(Here T is a frame in video sequence, R specifies the contour pixels of char- 
acter component for which the translation vector has to be calculated. |d7 |,|d^| 
< the search range ). 

When there is a camera motion, df d2 values are determined using the mo- 
tion vector obtained from each frame. We consider some deviation 
of motion vectors and adjust the dl(= dl ± 1)^ d2( =d2 1) values. When there 

is no camera motion, we use default values of dl(= 16) ^ d2(=16). 

4 Experiment and Results 

The experiments have been performed on a Pentium PC with 550 MHz 
CPU. The program is implemented in Visual C++ Ver.6.0. The MPEGl data 
have been used for the experiments. The video image used are a movie as in 
Fig. 3. Fig. 3(a) show a sequence of the original image. Fig. 3(b) show recovered 
image. Experimental results show that the case of image having little motions is 
well recovered. And we see that the case of having motion in complex background 
is also recovered. We can see that using information about camera motion and 
video information gives more accurate recovery. In case the movement of objects 
is sudden or large, we see some distortion in recovering original image. 

5 Conclusion 

As a result of experiment, we know that the stationary image and image 
having little motion is well recovered. But the images having a lot of movement 
in complex background show some distortion. Therefore the following should 
be researched more in the future. Firstly, more sophisticated recovery method 
is needed for processing images with large and complex motions such as action 
movies. Secondly, for recovering in case when more than two scene changes occur 
in caption region, we will need a method using panorama technology to recovery 
it. For dissolve captions aimed at smooth feeling to audiences, we need a method 
of interpolation by frame interval control. 
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Fig. 3. A sequence of reconstructed original images in movie video 
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Abstract, Contemporary digital video, film or multimedia presentations are often accom- 
panied by the surround sound. Techniques and standards involved in digital video processing 
are much more developed than concepts underlying creating recording and mixing of the mul- 
tichannel sound. The main challenge in the sound processing in the multichannel system is to 
create an appropriate basis for the relating multimodal context of visual and sound domains. 
Therefore, one of the purposes of experiments is to study in which way and how the surround 
sound interferes or is associated with the visual context. This kind of study was hitherto carried 
out when two-channel sound technique was associated with a stereo TV. However, there is not 
much study done yet that associates surround sound and digital video presented at the TV 
screen. The main issue in such experiments is the analysis of the influence of visual cues on 
perception of the surround sound. This problem will be solved with the application of fuzzy 
logic to the processing of subjective test results. 



1 Introduction 

There are many scientific reports showing that human perception of sound is af- 
fected by image and vice versa. For example, Stratton in his experiments carried out at 
the end of 19th century proved that visual cues can influence directional perception of 
sound. This conclusion was confirmed by Klemm [1], Held [2] and others. Gardner 
experimentally demonstrated how the image can affect the perceived distance between 
the sound source and the listener [3]. The phenomenon of interference between the 
audio and video stimuli was reported also by Thomas, Witkin, Wapner and Leventhal 
[4]. Very important experiments demonstrating interaction between audio and video in 
stereo TV were made by Brook, Danilenko, Strasser [5] and Wladyka [6]. However, 
still there is no clear answer to the question how the video influences the localization 
of virtual sound sources in multichannel surround systems (e.g. DTS). Therefore, there 
is a need of systematic research in this area, especially as sound and video engineers 
seek such information in order to optimize the surround sound. The results of this kind 
of research may improve production of movie soundtracks, recording of music events 
and live transmissions, thus the resulting surround sound may seem more natural to the 
listener. The experiments are based on the subjective testing of a group of people, so- 
called experts, listening to the sound with- and without vision. The obtained results are 
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processed in order to find some hidden relations underlying the influence of video to 
the perception of audio, particularly with regard to the influence of video to the direc- 
tivity of localization of sound sources in the surrounding acoustical space. Some soft 
computing methods could be used to the processing of subjective test results, bringing 
better results of the analysis than statistical methods, particularly if the number of tests 
and involved experts are reasonably small. An approach to such an application is 
formulated in the paper. The proposed method of analysis of subjective opinion scores 
could be also used in other domains than audio-video perception investigation (public 
opinion analysis etc.). 



2 Experimental Background 



Results of such experiments may show in which cases and in what way the video 
can affect the localization of virtual sound sources. In most cases the video "attracts” 
the attention of the listener and, as a consequence, he or she localizes the sound closer 
to the screen center. Therefore, this effect can be called audio-visual proximity effect. 

In the experiments two rooms are used: auditory room and control room (Fig. 1), 
which are acoustically separated. A window between these two rooms allows for pro- 
jection of video ftom the control room to the auditory room. A view of the auditory 
room is presented in Fig. 2. The place for the listener is positioned in the so-called 
"sweet spot" (the best place for listening). 




I pnojectpr~| 1 



amplifier with 
Dolby Digital 




, video 


audio J 


computer with 


decoder 


1 


DVD player 



CONTROL ROOM 



Fig. 1. Setup used during experiments 




During tests AC-3 (Dolby Digital) audio encoded and MPEG2 video encoded files 
were used. Sound files were prepared in the Samplitude 2496 application and then 
exported to the AC-3 encoder. The following equipment was used during the tests: 
computer with built in DVD player, amplifier with Dolby Digital decoder, video pro- 
jector, screen (dimensions: 3x2 m), loudspeakers. 
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Preliminary Listening Tests 

In the prelimiiiai^ experiments the arrangement of loudspeakers was as follows: 
four loudspeakers were aligned along the left-hand side of the screen (Fig. 3). In this 
case, the first loudspeaker was placed at the edge of the room, whereas the fourth one 
was posilioned under the screen. This arrangement of loudspeakers allowed for how 
the visual object can aflect the angle of Hie subjectively perceived sound source. 
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Fig. 3. Arrangement of loudspeakers during the tests 



The experiment scenai'io was as follows. In the first phase of experiment white 
noise was presented from the loudspeakers with random order. The expert’s task was 
to determine from which loudspeaker sound was heard. Then, in liie second phase of 
experiment a blinking object was displayed in the center of the screen with synchro- 
nously generated white noise. In the center of the circle a one-digit number was dis- 
played. Each time the circle was displayed the number was changed. The reason of 
that was the need of drawing the attention of the listener to the picture. Obtained re- 
sults show that image proximity effect is speaker dependent, however most expens’ 
results clearly demonstrate the mentioned effect. The most prominent data showing 
this enecl is shown in Fig. 4. The shift in the direction to the centrally located loud- 
speaker is clearly visible. 




Physical physical source 

(No. oflhe loudspeaker) (No. of the loudspeaker) 

Fig. 4* Comparison of answers of the ex'pert S.Z. for two types of experiments: 
witlK)ul video / with ^ ideo 
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3, Fuzzy Logic Processing of Subjective Results of Surround Sound 
Directivity Testing 

The subjective tests presented below aimed at finding a relation between precise 
surround directivity angles and semantic descriptors of horizontal plane directions. It 
is hardly to expect an expert to be exact in localizing phantom sources in the surround 
stereophonic base and to provide precise values of angles. On the other hand, it seems 
quite natural that an expert will localize a sound using such directional descriptors as: 
left, left-front, front, right-front, right, rear-right, rear, rear-left. Thus, first series of 
the experiment should consist in mapping these descriptors to angles as in Fig. 5. 




Fig. 5. Questionnaire form used in the first stage of experiments 

In this step of the investigations, sound samples recorded in the anechoic chamber 
should be presented to the group of experts. Experts, while listening to sounds ex- 
cerpts, are instructed to rate their judgements of the performance using such descrip- 
tions as introduced above. In order to obtain statistically validated results various 
sound excerpts should be presented to the sufficiently large number of experts during 
experiments. This procedure is based on the concept of the Fuzzy Quantization 
Method (FQM) applied to acoustical domain [7], [8]. Since the experimenter knows to 
what angle a given sound was assigned, thus this stage of experiments will result in 
mapping semantic descriptors received from experts to particular angles describing the 
horizontal plane. 

In order to simplify this phase of tests, localization sphere should be divided into 5° 
steps. Fig. 6 shows exemplary mapping of the front membership function. All other 
membership function should be estimated in a similar way (see Fig. 7). 




Fig. 6. Experts’ votes for the front membership function, N - number of experts voting 
for particular values of localization (variable: angle) 
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As shown in Fig. 6 and 7, distribution of the observed instances may suggest a typi- 
cal trapezoidal shape of a membership function. In the next step of the analysis, such 
membership functions should be identified with the use of some statistical methods. 
This can be done by several techniques. The most common technique is linear ap- 
proximation, where the original data range is transformed to the interval [0,1]. Thus, 
triangular or trapezoidal membership functions may be used in this case. In the linear 
regression method, one assigns minimum and maximum attribute values. Assuming 
that the distribution of parameters provides a triangular membership function for the 
estimated parameter, the maximum value may thus be assigned as the average value of 
the obtained results. This may, however, cause a loss of information and bad conver- 
gence. The second technique uses bell shaped functions. The initial values of parame- 
ters can be derived from the statistics of the input data. Further, the polynomial ap- 
proximation of data, either ordinary or Chebyshev, may be used. This technique is 
justified by a sufficiently large number of results or by increasing the order of poly- 
nomials; however, the latter direction may lead to a weak generalization of results. 
Another approach to defining the shape of the membership function involves the use 
of the probability density function. The last mentioned technique was discussed in the 
given context more thoroughly in the literature [7]. 




Fig. 7. Directivity membership functions on the horizontal plane 

Intuitively, it seems appropriate to build the initial membership function by using 
the probability density function and by assuming that the parameter distribution is 
trapezoidal or triangular. The estimation of the observed relationships is given by the 
function shown in Fig. 8. 

The fj membership function from Fig. 8 is defined by a set of parameters: A, a, b, c, 
d and is determined as follows: 



fi(_7[,A,a,b,c, d)^ 



0 if 

A(x-a)/(b-a)if a<x<b 
A if b<x<c 
- A(x-d)f(d-c)ii c<x<d 



( 1 ) 
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fi(;^A,a,h,c,d) 
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0 




d 



Fig. 8. Trapezoidal membership function estimated by the probability density function 

The equation describing the mth moment of the probability density for the function 
fi(x,A,a, b,c,d) is calculated as follows: 



The estimate of the mth moment of the probability density function from the test 
(assuming that all observation instances fall into the interval y, where: j=l,2...k) is 
calculated according to the formula: 



/-I 

where: P(x=xj) represents the probability that the attribute value of instance x falls 
into the interval j. 

Next, the subsequent statistical moments of order from 0 to 4 for this function 
should be calculated. Then, by substituting the observed values into Eq. (3), the con- 
secutive values of are calculated. From this, the set of 5 linear equations with 5 
unknown variables A,a,b,c,d should be determined. After numerically solving this set 
of equations, the final task of the analysis will be validation of the observed results 
using Pearson’s test with k-l degrees of freedom [7]. 

Using the above outlined statistical method, a set of fuzzy membership functions 
for the studied subjective sound directivity can be estimated. 



4 Inter-Modal Testing Phase 

In order to proceed with testing the inter-modal relation between sound localization 
and video images, another questionnaire should be used. This time, experts’ task 
would be assigning the crisp angle value to the incoming sound excerpt while watch- 
ing a TV screen. Having previously estimated membership functions, it would be then 
possible to check whether the observation of video images can change sound localiza- 
tion and if yes then to what degree. This can be done by performing a fuzzification 
process. The data representing the actual listening tests would then pass trough the 
fuzzification operation in which degrees of membership should be assigned for each 
crisp input value. 






( 2 ) 




( 3 ) 
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The process of fuzzification is illustrated in Fig. 9. The pointers visible in this fig- 
ure refer to the degrees of membership for the precise value of localization angle 
335°. Thus, this value belongs, respectively, to the left-rear fuzzy set with the degree 
of 0, to the left-front fuzzy set with the degree of 0.65 and to the front set with the 
degree of 0.35. The same procedure should be applied to every sound-image instance. 

Consequently, the process of fuzzy inference can be started, allowing to find win- 
ning rules. The (exemplary) fuzzy rules have the following form: 

1 . if front AND FRONT than no shift 

2. if leftjront AND LEFT FRONT than no shift 

3 . if left Jront AND FRONT than slight _shift 

4. if left AND LEFT than no shift 

5 . if left AND LEFT FRONT than slight _ 



where small italic labels denote current directivity indices and the capital italic labels 
denote the directivity of the same sound played back during the previous tests (in the 
absence of vision). 

It was assumed that the presence of vision is causing the shifting of sound localiza- 
tion to the front direction only (not to opposite directions in relation to the frontal one) 
and there is no possibility for phantom sources to migrate from the left to the right 
hemisphere and vice versa. These assumptions have been justified in practice. The 
rules applying to the right: front lateral and rear directions are similar to above ones. 
The AND function present in the rules is the 'fuzzy and" [9], thus it chooses the 
smaller value from among these which provide arguments of this logical function. The 
consequences: no shift, slight _shift\ medium shift, strong shift are also fuzzy no- 
tions, so if it is necessary to change them to the concrete (crisp) angle values, a defuz- 
zification process should be performed basing on the output prototype membership 
functions. 




Fig. 9. Fuzzification process of localization angle: (1) - left-rear, (2) - left-front, 
and (3) - front membership functions 



All rules are evaluated once the fuzzy inference is executed and finally the strongest 
rule is selected as the winning one. These are standard procedures related to fuzzy 
logic processing of data [9]. The winning rule demonstrates the existence and the 
intensity of the phantom sound source shifting due to the presence of vision. Since the 
fuzzy rules are readable and understandable for human operators of the system, thus 
this application provides a very robust method for studying complex phenomena re- 
lated to influence of vision coming from frontal TV screen to the subjective localiza- 
tion of sound sources in surround space. The mentioned defuzzification procedure [9] 
allows also for mapping of fuzzy descriptor to the crisp angle measure every time 
when it is necessary to estimate such a value. 
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5 Conclusions 

Audio and video interact with each other. Mechanisms of such interaction are cur- 
rently investigated in two domains: perceptual and aesthetic ones employing fuzzy 
logic in the process of analysis of tested subjects’ answers. The results of such ex- 
periments could yield the recommendations to sound engineers producing surround 
movie sound tracks, digital video and multimedia. 
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Abstract. The optimization of rough set based classihcation models 
with respect to parameterized balance between a model’s complexity 
and conhdence is discussed. For this purpose, the notion of a parame- 
terized approximate inconsistent decision reduct is used. Experimental 
extraction of considered models from real life data is described. 



1 Introduction 

While reasoning about a domain specified by our needs, we usually base on 
the information gathered by the analysis of a sample of objects. The rough 
set theory ([3]) assumes that a universe of known objects is the only source 
of knowledge, which can be applied to construct models of reasoning about 
new cases. Reasoning can be stated, e.g., as a classification problem, concerning 
prediction of a decision attribute under information provided over conditional 
attributes. For this purpose, one stores data within decision tables, where each 
training case drops into one of predefined decision classes. 

Classification of new objects is performed by analogy, e.g., by the usage of 
” if... then...” decision rules calculated over the universe of a given table. Theo- 
retical studies related to the Minimum Description Length Principle (MDLP) 
(cf. [5]), as well as practical experiences, lead to the same conclusion: Optimal 
rule-based classification models should be extracted from data by tuning up a pa- 
rameterized tradeoff between the overall confidence and complexity of the decision 
rule collections. 

Confidence of a rule-based model can be interpreted as the expected chance 
of correct classification of new cases. To express such a chance numerically, we 
need to set up the model of representing inexact conditions^ decision dependen- 
cies. Then, we are able to evaluate the degree of decision information provided 
by each particular subset of conditional attributes, and to express the dynamics 
of this degree under the attribute reduction. In the same way, one can interpret 
the complexity of a given collection as opposite to the expected chance of reco- 
gnizing new cases by its decision rules. It leads to a rough set based version of 
MDLP, related to the fundamental concept of searching for approximate decision 
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rediicts: Given a rule-based decision model, any simplification which a.pp)roxima- 
tely preserves the expected chance of correct classification should be performed to 
increase the expected chance of the new case recognition. 

The above principle can be regarded as the starting point for the design of 
the process of the rule-based classification model optimization. In the paper, wc 
discuss exemplary methodology of setting up the foregoing items, concerning the 
adjustment of thresholds, voting measures, etc.. Accordingly, Section 2 includes 
the basics of rough set based classification techniques (cf. [4], [6]). Sections 3 and 

4 outline exemplary extensions of these methods by introducing the notion of an 
approximate reduct based on a normalized decision function (cf. [7]). In Sections 

5 and 6 we present the main contribution - the classification algorithm based on 
the family of parameterized decision functions. Section 7 contains experimental 
verification of the performance of the proposed classification framework. 

2 Decision rules and reducts 

In the rough set theory sample of data takes the form of an information system 
A = (f7, A), where each attribute a G A is a function a : U ^ Va into the set of all 
possible values on a. Reasoning about data can be stated as, e.g., a classification 
problem, where a distinguished decision is to be predicted under information over 
conditional attributes. In this case, we consider a triple A = (I/, A,d), called a 
decision table, where, for the decision attribute d ^ A, values Vd G Vd correspond 
to mutually disjoint decision classes of objects. 

Definition 1. Let A = ([/, A,d), where A = (ni, . . . , a|^l), be given. For any 
D ^ A, D = i?-information function over U is defimed, by 

rrifniu) = («n (u): ■■■, «i|B| ('«)) (1) 

The 5-indiscernibility relation is the equivalence relation defined by 

INDa{B) = {{u,u') €UxU: = ^b(wO) (2) 

Each u G U induces a indiscer nihility class of the form 

M/, = {u' G U : u') G INDa{B)} (3) 

which ca.n he identifi>ed with vector In} q{u). 

Indiscernibility enables us to express global dependencies as follows: 

Definition 2. Given A = ([/, A, d), we say that B C A defines d in A 

INDa{B) C IND^iid}) (4) 

or, equivalently, iff for any u G U the following u-oriented rule is valid in A; 

^{d = d{u)) (5) 

We say that B C A is a decision reduct iff it defines d and none of its proper 
subsets does it 
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Given a collection of subsets B C V{A) which define d, we can classify any new 
case Unew ^ U hy using the bunch of decision rules of the form (5). The only 
requirement is that Unew must be comparable with U with respect to at least 
one (several) of B E B. To improve such understood recognition of new cases, we 
base on (approximate) decision reducts of possibly low complexity, expressible 
in various terms (cf. [2], [8]). 



3 Normalized decision functions 

In a consistent decision table A = (G, A,d) - where each indiscernibility class 
of INDj,y{A) has one decision value - decision rules lead to deterministic clas- 
sification within the universe U. In case of inconsistent decision tables (non- 
deterministic dependencies among attributes), we should specify the way of de- 
aling with uncertainty 

Definition 3. Let A = (G, A,d)^ linear ordering Vd = ^ 

and B C A be given. By a 5-rough membership distribution we eall the 

funetion djB * ^ ^ defined hif 

• • • ihd=r/BW) ( 6 ) 

where^ for k = Ld=k/B{'^) = \{A G \o\b • d{v/) = ^fc}|/|[^]s| is the 

rough membership funetion (ef. [f]) labeling u £ U with the degree of hitting the 
k-th deeision elass with its B-indiseernibility elass [u]b^ 

Distributions of the form (6) seem to express the most accurate knowledge about 
dependencies of the decision on conditions (cf. [4], [7], [9]). Thus, it should be 
possible to model various 5-based reasoning strategies as functions acting over 
djB “forgetting” a part of frequency information, which is redundant with 
respect to a given approach. 

Definition 4. ([1]) Let A = (5, A, d)^ B C A and f : A^_i ^ A^_i^ r = \Vd\y 
be given. We say that f is a normalized decision function (ND-funetion) iff 
it satisfies the following^ logieal and monotonie eonsisteney assumptions: 

Vfc {s[k] = 0 ^ 4>{s)[k] =0) A Vfc,, {s[k] < Sp] ^ 4>{s)[k] < 4>{s)[l]) (7) 

Funetion 4> d/Bi'^) ~ ealled a normalized fd/ B -deeision funetion. 

According to (7), a positive weight cannot be attached to a non-supported event 
and the relative chances provided by the reasoning strategy cannot contradict 
those derived directly from an information source. 

^ For any r G N, Ar-i denotes the (r — 1) -dimensional simplex of real valued vectors 
s = (s[l], . . . , s[r]) with non-negative coordinates, such that s[k] = 1. 
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Example 1, Consider ND-functions d.m : A^^_i — ^ A^^_i defined by 
<9(s)[/c] = ( |{/ : s[l] > 0 }|“^ for s[k] > 0 

{m{s)[k] — ) i ( |{^ : s[l] — max( 5 )}|“^ for s[k] — max(s) ) ( 8 ) 

0 otherwise 

V 

where max(.s) = max^ .s [/;:] . One can see that by combining u G 

with d and m wc obtain the uniform distributions spanned over the subsets: (1) 
^ induced by the generalized decision (cf. [6]); (2) ^ of 

decision values taking the maximum over the coordinates oijt 

4 Normalized decision measures 

In real life applications, the search for attributes which approximately preserve 
0-decision distributions seems to be promising. We are likely to understand an 
approximate 0-decision reduct as a minimal subset of conditions, which almost 
preserves information about decision in terms of a given ND- function. 

Definition 5. Let A = {U^A,d) and 0 G A^^-i ^ A^_i, r — \Vd\, be given. 
The normalized 0-decision measure ^ [O 7 1] defined by^ 

= 1^1 Y^ueuil^ d/D{'^^) \ 4^ d/ni'^)) (9) 

The value of (9) equals to the average probability that objects u E U will be 
correctly classified by a random 0 -weighted choice among decision classes 

([7]). Thus, 0-decision measures enable us to evaluate subsets numerically with 
respect to their capabilities of 0-defining decision. 

Definition 6 . Let A = (Lf^A^d), e E [0, 1) o,nd 0 G Ar>_i — ^ Lsr-i, = iVdl, be 
given. We say that subset B E A er-approximately 0-defines d iff 

E(P/a{B) > (1 — ^)E^f^{A) ( 10 ) 

We say that B E A is an s:-approximate 0-decision reduct iff^ it (j)-defines d 
£- approximately and none of its proper subsets does it 

Two parameters can be tuned up while searching for optimal conditions for 
classification: (1) ND- function 0 responsible for the way of understanding inexact 
conditions^ decision dependencies, and (2) the degree 5 G [0, 1) up to which we 
are likely to neglect the decrease of 0- decision information provided by smaller 
subsets B E A with respect to the whole of A. In case of the first parameter, it 
is easier to handle a numeric factor responsible for adjusting a specific function: 

^ By ” ('I-)” we mean the inner product of two distribution vectors. 
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Definition 7. Let A = {U^A^d) be given. For any x C (0,+oc)^ v)e define the 
normalized x-decision function by putting^ for any s C A^_i^ r = \Vd\y 

x{s)[k] = {s[k]rii:umT (n) 

Proposition 1. All x-deeision funetions satisfy (7). For any s G A^-i^ 

lim^^o+ = d{s) A lim^^+oo ^(s) = rn{s) (12) 

For any A = (?7, A^ d) ^ B ^ A^ 0 < x\ < X 2 < +cxd^ vje have 

Eo/idB) < < E^^/j,{B) < E^/VB) (13) 

vjhere egualities hold iff Eq 

One can see that the obtained x- parameterized family covers densely enough all 
possible ways of performance of ND-functions over training data. 

5 Optimization of approximate reducts 

In Section 1 we suggested to relate the overall confidence of a rule-based decision 
model to the expected chance of correct classification of new cases. Above, we 
argued that the quantities of normalized decision measures can be interpreted in 
that way. Analogously, let us now propose an exemplary measure of the expected 
chance of recognizing new cases by decision rules generated by a given B C A. 

Definition 8. Let A = (f/, A,d) be given. TAe normalized coverage measure 

covp^ : V{A) [0, 1] defined by 

coua(-B) = jTT\T.ueu (14) 

where ij,b{u) = IM_B|/|tl| is the frequeney of oeeurrenee of veetorlnf g (u) in A. 

One should realize that this is just one of possibilities of estimating the reco- 
gnition probability (cf. [8]). Still, we would like to proceed with this measure, 
because it turns out to be flexible enough with respect to applications. 

The exemplary procedure presented below searches for an ^-approximate x- 
decision reduct B C A with the highest possible coverage covj,i{B)^ by following a 
randomly generated permutation a G over conditional attributes^ First, 
starting with B = we add the foregoing attributes until B begins to x- 

define d in £- approximate way. The second part reflects the optimization principle 
formulated at the beginning of this paper: We try to reduee a model until it e- 
approxirnately preserves x-deeision information^ to inerease as the measure 
of predieted average ehanee of the new ease reeognition. 

^ We denote by En the set of all n-element permutations, i.e. ”1-1” functions 
a : {1, . . . , n} ^ {1, . . . , n}. We use a G T[a| to re-order A = (ai, . . . , a|A|)- 
^ We base the search for cova - maximal approximate decision redacts on random heu- 
ristics, because this optimization problem is NP-hard (cf. [7]). 
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Algorithm: Approximate reducts generation 

Input: Decision table A = {U, A, d), permutation a of A, £ E [0, 1), js 6 (0, +oo). 
Output: e-approximatc a?-dccision rcduct. 

1. B := cov := coua(B); app := i := 2 

2. while app < (1 - begin 

3. B := BU {a^(i)} 

4. (cov, app) = Test (B , x) 

5. i := i + 1 

6. end while 

7. maxcov 0 

8. do 

9. stop := 1 

10. for each a.j E B 

11. cov cov^(B \ {aj}) 

12. app := E,^/^(B \ {aj }) 

13. if app < (1 — £)E^^^j!f^(A) and cov > maxcov then begin 

14. maxcot> := cov 



15. maxj := j 

16. stop 0 

17. end if 

18. end for 

19. if stop = 0 then B := B \ {omaxj} 

20. while stop — 0 

21. return B 



6 Optimization of the approximate reduct collections 

A rule- based decision model should correspond to more than one subset of con- 
ditions. Thus, we construct systems composed of collections of classifying agents 
based on different ^-approximate x-decision reducts, obtained as couA-optimal 
while following a number of randomly generated permutations. Since it is not 
known how to adjust the best configuration of £ G [0, 1) and x G (0, +oo), we con- 
sider collections of (e, x‘)-parameterized agents initialized randomly, to simulate 
a kind of the adaptation process searching through the space of [0, 1) x (0, +oc). 

Given B C V{A) x (0, +oo) as the collection of obtained parameterized re- 
ducts, one needs also to specify the way of voting between particular agents. In 
general, negotiations concerning prediction of the decision value for a given u 
lead to the choice of Vk G Vd with a maximal value of a voting measure, calcula- 
ted from ^--oriented x-decision rules induced by pai'ticular elements of B. Below, 
one can find examples of such voting measures: 

VOTEl{B) — Y^(^b,x)EB: pb(u)> 0 l^i^i'^)^d-k/B{u) 

VOTE2{B) = E(B,x)eB-.iiB(u)>o^^d=k/B{u)xd=k/B{u) (15) 

VOTE‘6{B) = E{B,x)eB: i.iB{u)>0^d=k/Bi’d') 

Another problem concerns the fact that although all agents can be used to 
classify new objects, a subset of them often performs much better (cf. Fig. 1). 
We apply a specific genetic algorithm to search for optimal sub-collections of 
agents. In particular, it results with an indirect optimization process concerned 
with the ranges of (e, x)-parameters. 
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Fig. 1. Classification quality (vertical axis) and the number of agents in a team (horizontal axis) 
— examples obtained for “DNA splices” and “primary tumor” data sets. 



7 Experimental results 

The methodology described in the paper was implemented and tested on several 
data sets obtained from [1]. Results presented in Table 1 concern two of them: 
(1) “DNA splices” - 2000 objects, 20 symbolic attributes, 3 decision classes; (2) 
“Primary tumor” - 339 objects, 17 s^mibolic attributes, 22 decision classes. 



Voting 


Approx 


Modfi 


HfiSU-lt 
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- 


- 


82,06 


2 


- 
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86.96 


3 
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93,75 
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3 
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87,05 
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- 


43.28 
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0.3 


- 


43.48 
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0.2 
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40.80 
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0.1 
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39.53 
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0.4 




44.46 
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0.4 


exp(-l) 
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0.4 


exp(O) 


44,70 
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0.4 


exp(l) 


42.85 


2 


0.4 


sxp(2) 


42.79 



Table Experimental results for “DNA splices” and “primaiy tumor” data sets, obtained by 
voting among optimized collections of approximate a:-decision redacts' (1) The choice of a measure 
from (15) corresponds to the Voting column; (2) Quantities of e and ac are chosen randomly from 
small intervals around values in the Approx and Mode columns, where symbol means the uniform 
random choice from a wider interval; (3) Average percent of tested objects classified correctly for 
particular settings is presented in the R&sult column, Cross-vaiidation (CV-5) was used in case of 
“primary tumor” data. 



Experiments presented in Table 1 were performed with various settings: 

L Optimal voting meaBure was selected by setting other parameters randomly; 

2. For the best voting method, several values of 6 6 |0,0.5) were tested; 

3. For the best voting method and e-thresholds selected from the small interval 
around the best value Found previously, different values of paranieter x € 
[exp(“2),exp(2)J were tested. 

It is worth noting that the best results obtained in our experiments are close to 
the best results ever found. 
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8 Conclusions 

Parameterized tradeoff' between model complexity and its accuracy was discus- 
sed. To handle it in a flexible way, the notion of a parametrized 5 - approximate 
x’-decision reduct was used. Main issues concerning implementation of the clas- 
sification algorithm based on described methodology were outlined. 

Experiments performed on two “benchmark” data sets show that our tech- 
nique is relatively fast and very efficient. It is worth noting that best results for 
these sets were obtained using significantly different voting and (x, £)-settings. It 
suggests us to consider the adaptive mechanisms of tuning up these parameters 
in the nearest future. 
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An Application of Rough Sets and Haar 
Wavelets to Face Recognition 
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Abstract. The paper presents an application of data mining methods 
for face recognition. The proposed methods are based on wavelets, prin- 
cipal components analysis, rough sets and neural networks. The features 
from the face images have been extracted based on the Haar wavelets 
followed by the principal component analysis (PCA), and rough sets 
processing. We have applied the rough sets methods for selection of fa- 
cial features based on the minimum concept description paradigm. The 
recognition of facial images, for the reduced features, has been carried 
on using error backpropagation neural network. 

Keywords: rough sets, face recognition, feature selection, wavelets 



1 Introduction 

The selection of the best feature sets representing recognized objects is an im- 
portant step in the classifier design. The feature selection process is generally 
goal, data, and classifier type dependent [1],[2]. We present an application of 
feedforward error backpropagation neural network for face images classification 
using Haar wavelets for feature extraction, and Principal Component Analysis 
followed by rough sets method for feature reduction and selection. 

We emphasize rough sets methodology [1],[2],[3] and a minimum concept de- 
scription paradigm, for selection of the final face feature vector. 

The paper begins with the brief description of Haar wavelets transform for ex- 
traction of facial features. In the following sections we shortly present the princi- 
pal component analysis, and the rough sets theory as a foundation of methods for 
feature selection/reduction. Finally, the description of numerical experiments of 
face recognition, using the neural network classifier and the presented methods 
of feature extraction, projection and selection, concludes the paper. 



2 The 2D Haar Wavelets Transform 

The wavelet transform is a method of approximating a given function /(t) G 
L^(R) using other function (a wavelet function) representing a scalable ap- 
proximation curve localized on definite time (or space) interval. Two-dimensional 
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parametrization, with a dilation parameter a and and a translation parameter 
6, yields a family of continuous wavelets 

, a, 6 G R, a > 0 (1) 

where a and b may vary over R. For a given function f{t) the continues wavelet 
transform is defined as 

di ( 2 ) 

a J 

The discrete wavelets are obtained, by sampling parameters a and 6 as a = ag 
and b = kabo = ka^b (where j^k G Z, i^j =+ 1,+ 2,- • •) 

- kbo), j,k eZ (3) 

where constitutes basis for L^(R). 

The discrete wavelet transform is defined for sampled parameters by the equation 

/ oo 

f{t)id{a-H-kho)dt (4) 

-OO 

For ao = 2 (a = 2^) and 6 q = 1 (6 = 2^k)^ functions form an or- 

thogonal wavelet base for L^(R): '(pj^k{t) = 2~2^ip{2~H — k). Multi-resolution 
analysis of a function f{t) can be realized using dilated and translated scaling 
function (f)j^k = = 2. For discrete parameters a and 6, a 

discrete wavelets transformation decomposes a function, determined by N dis- 
crete samples {/(I), /(2), • • • , /(Af)}, into an expansion of two function: a scaling 
function cf){t) and a wavelet function ^(t). The basis set for a scaling function 
(non-normalized) are 

^L,k{t)=^{2H-k), k = l,2,---,KL, Kl = N2-^ (5) 

where L is an expansion level, and for the wavelet function 

idj^k{t) = id{2H-k), j = l,2,---,L- k=l,2,---,K- K = N2-^ ( 6 ) 

where the level of expansion L satisfies: 0 < L < log 2 {N), 

An L- level discrete wavelets transform of function /(t) described by N samples 
contains: 

1. a set of parameters {a^^k} defined by the inner products of f{t) with N2~^ 
translations of scaling function cf){t) at L different widths 



WT f{a,b) = |a |-2 






'4’a,b{t) = \a\ 



t — b 



{<^L,k} = {< f{t),4>L,k{t) >] A: = 1,2, • • • , ATl, Kl = N2 (7) 
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2. a set of parameters {bj^k} defined by the inner products of f{t) with N2 ^ 
translations of wavelet function at a single width 

= >,j = 1,2, - ■■ ,L;k = 1,2,- ■■ ,K;K = N2-^} 



The one of the simplest wavelets, the Haar wavelet is defined as follows 



1, if tG [0,0.5) 

= -( -r */ ^ e [0.5,1) 

0, otherwise 



( 8 ) 



The dilations and translations of the Haar wavelet function form an orthogonal 
wavelet base for L^(R). The mother Haar wavelets is defined as 



'4’j,k(t) = 'f{2H - k), j,keZ 
The Haar scaling function is the unit-width function 



f{t) = 



1, i/ 0 < ^ < 1; 

0, otherwise 



(9) 



( 10 ) 



For function f{t) E L^(R), the discrete wavelet expansion of f{t) is represented 
as 

d 2^-^ 

f (t) = ao^o<t>o,o{t) + EE bj,k'4^j,k(t'^ (11) 

j=l k=0 

where <^o,o(^) is a scale function on interval [0, 1), is the set of wavelets 

with different resolution. 

The two-dimensional wavelet transform can be realized by successive ap- 
plying the one-dimensional wavelet transform to data in every dimension. The 
two- variable wavelet function can be defined as a product of two one-dimensional 
mother (generating) wavelets ^(ti) and ^(t 2 ) 



( 12 ) 

With dilation and translation parameters the two-dimensional wavelets function 
is defined as 

t^(ai,a2),(5i,52)(^lT2) t^(ai ,5i ) (^1 )t^(a 2 ,62 ) (^2 ) ( 1 ^) 

where Assuming that ai = a 2 the two-dimensional 

wavelet expansion of two- variable function /(ti,t 2 ) can be expressed as 

/ OO POO 

/ f(tlM)'ho.),{bi,b-2)(tld2)dtidt2 (14) 

-00 J —00 

For a two-dimensional grid of 2^ x 2^ values (for example an image pixels) and 
for discrete parameters a\ = 2 ^Wi^ = 2 ^W 2 ^ b ‘2 = 
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with integer values for and k‘z the 2D discrete wavelet function can be 

defined 

V'ji,i 2 ,fci,fc 2 (^b^ 2 ) = - ki)'ip{2^^^^t2 -k2) (15) 

where and hz are the dilation and the translation coefficients for each 

variable. Additionally, defining the scaling function 

02,^1, ^2 (^ 17 ^ 2 ) = 2^ —ki) -kz) (16) 

allows to define a complete basis to reconstruct a discrete function /(ti,t 2 ) (for 
example a discrete image): 

<^o, 0 , 0 , 0(^1 A 2 ) = 2“^(^(2“^ti)(^(2“^t2) (17) 

= 2^-^^(2^'i-^ti - ki)(j>{2~^nz) 

The 2D discrete wavelets coefficients are defined as 

2DWT ^ j j /(^i^^^)^ii02,fci,fc2 dh (18) 

where ^jij 2 ,ki,k 2 denotes any of previously defined orthonormal bases. These 
coefficient can be formed as the Haar wavelets coefficient matrix P. 



2.1 Pattern forming based on the Haar wavelet transform 

An image pattern can be formed with coefficients of one or few levels of 2D 
discrete wavelet transform of an image, represented by the rectangular array 
of gray level pixels. We have used the first level of 2D discrete Haar wavelet 
transform of centered r x r = 2® x 2® (r = 2^^ = 64, = 6) subimage of a face. 

The r X r element original first level Haar wavelets matrix P can be used to form 
a nHaar = V X V element Haar pattern as a concatenation of the matrix rows 
^Haar = [pi, P2, * * X Pr]^ ^ where is a ith row of the first level Haar 

wavelets transform coefficient matrix P. 

Despite of the expressive power of the Haar wavelets transformation it is difficult 
to say arbitrarily how powerful the Haar wave lets- based features could be for a 
classification of face images. 

Experiments have shown that the Haar pattern can be heuristically reduced by 
removing trailing element of each row of the original 2D Haar wavelets matrix 
P. The heuristically reduced Haar wavelets-based pattern is formed with reduced 
rows of Haar wavelets coefficient matrix P^ = \Pr, 1 , Pr,2 , • • • , Pr,r]^ where Pr,i is 
the reduced by columns ith row of the Haar wavelets matrix P. The reduction 
number is chosen by assuring that r — leading elements in all rows of the 
matrix P^ have values below heuristically selected threshold enaar- This can 
result in riHaar.r = r X {r — Tr) element reduced Haar wavelets patterns XHaar,r^ 
In the next sections we discuss techniques of finding reduced set of face image 
features. 
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3 Principal Component Analysis 

We have applied Principal Component Analysis (PCA) [1] for the orthonormal 
projection (and reduction) of reduced Haar wavelets patterns ^Haar,r of facial im- 
ages. Let us assume that a limited size sample of N random riHaar.r- dimensional 
patterns ^Haar,r ^ representing extracted features by the Haar decom- 

position of face image matrices has been gathered as an unlabeled training data 
set Tnaar.r = {^Haar,r7^Haar,r7 7 ^Haar,r} represented as a W X n data pattern 
matrix X = [^Haar,r 7 ^Haar,r 7 * * * The Optimal linear transformation 

of reduced Haar patterns y = WxHaar,r, into m = nHaar,r,pca,r element reduced 
PCA pattern y = ^Haar,r,pca,r^ is provided using the m x n optimal Karhunen- 
Loeve transformation matrix W = klt = composed with 

m rows being the first m orthonormal eigenvectors (corresponding to eigenval- 
ues arranged in decreasing order) of the original data X covariance matrix Rx- 
The optimal matrix W transforms the original nnaar.r- dimensional patterns 
^Haar,r i^to m = np{ aar,r,pca,r -dimensional (m < nHaar.r) reduced PCA fea- 
ture patterns y = ^Haar,r,pca,r]'y = (WX^)^ = XW^, minimizing the mean 
least square reconstruction error. The open question remains, which principal 
components to select as the best for a given processing goal. We discuss in the 
next section an application of rough sets for feature selection of reduced PCA 
patterns. 



4 Rough Sets 

The rough sets theory has been developed by Professor Pawlak [1] for knowledge 
discovery in databases and experimental data sets. Let us consider an informa- 
tion system given in the form of the decision table 

DT=< C, CUL>, y, / > (19) 

where U is the universe^ afinite set of N objects {xi,X 2 , ...,xx}, Q = C U D is 
a finite set of attributes^ (7 is a set of eondition attributes, H is a set of deeision 
attributes, V = where Vq is the set of domain (value) of attribute 

Q ^ Qy f '• U X (CU T>) ^ y - is a total deeision funetion (information function, 
decision rule in DT) such that /(x, q) EVq for every q E Q and x E V. 

For a given subset of attributes A C Q the IND(A) (denoted by A) 

IND(A) = {(x, y) E U : for all a G A, /(x, a) = /(y, a)} (20) 

is an equivalenee relation on universe U (called an indi seer nihility relation)). 

For a given information system S a given subset of attributes A C Q determines 
the approximation space AS = (C, IND(A)) in S. For a given A C Q and X C U 
(a concept X), the A-lower approximation AX of set X in AS and the A-upper 
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approximation AX of set X in AS are defined as follows: 

AX = {xeU : [x]a C X} = |J{y eA*:YCX} (21) 

AX = {x e U : [x]a nx a 9} = |J{r eA*:YnXA9} (22) 

A reduct is the essential part of an information system (related to a subset of 
attributes) which can discern all objects discernible by the original information 
system. A core is a common part of all reducts. Given an information system 
S condition and decision attributes Q = (7 U for a given set of condition 
attributes A C C we can define a positive region A POSa{D) in the relation 
IND{D), as 

POSa{D) = \J{AX\X eIND{D)} (23) 

The positive region POSa{D) contains all objects in U which can be classified 
without error (ideally) into distinct classes defined by IND{D) based only on 
information in the relation IND{A). 

For an information system S and a subset of attributes A C Q an attribute 
a G A is called dispensable in the set A if IND{A) = IND{A — {a}) (it means 
that indiscernibility relations generated by sets A and A — {a} are identical). 
The set of all indispensable attributes in the set A C Q is called a core of A in 
S. and it is denoted by CORE{A). 

We have applied an idea of rough sets reduct to the proposed technique of feature 
selection- red action of face images and corresponding data sets reduction. 



5 Rough sets for feature reduction/selection 



The PCA does not guarantee that selected first principal components, as a fea- 
ture vector, will be adequate for classification. One of possibilities for selecting 
features from principal components is to apply rough sets theory [1], [2]. Specif- 
ically, defined in rough sets computation of a reduct can be used for selection 
some of principal components being a reduct, these principal components will 
describe all concepts in a data set. 

The rough sets method is used for selection of a reduct from the discretized re- 
duced PCA patterns. The final pattern is formed from real- valued reduced PCA 
patterns based on the selected reduct. We have applied the following algorithm 
for feature selection: 

Algorithm: Feature extraction/selection using PCA and rough sets 



Given: AN — case data set T containing n — dimensional (n = nHaar,r,pca,r) 
patterns x = ^Haar,r,pca,r^ with real-valued attributes, labeled by I associated 
classes 



^target / y ' 



7 ^target J ’ 



7 ^target J 



1. Isolate from the original class labeled data set T, a pattern part as A x n 
data pattern matrix X. 
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2. Compute the covariance matrix for X. 

3. Compute for the matrix the eigenvalues and corresponding eigenvectors, 
and arrange them in descending order. 

4. Select the reduced dimension m < n of a feature vector in principal compo- 
nents space using defined selection method. 

5. Compute the optimal m x n transform matrix Wklt- 

6. Transform original patterns from X into m — dimensional feature vectors 
in the principal component space by the formula Y = XW^l for a whole 
set of patterns. 

7. Discretize the patterns in Y with resulting matrix Y^^. 

8. Compose the decision table DTm constituted with the patterns from the 
matrix Y^^ with the corresponding classes from the original data set T. 

9. Compute a selected reduct from the decision table DTm treated as a selected 
set of features X feature, reduct describing all concepts in DTm- 

10. Compose the final (reduced) real- valued attribute decision table DTf^r con- 
taining these columns from the projected discrete matrix Yd which are cor- 
respond to the selected feature set X feature, reduct- Label patterns by corre- 
sponding classes from the original data set T. 

The results of discussed method of feature extraction/selection depend on a data 
set type and three designer decisions: a) selection of dimension m < nnaar,r in 
the principal component space, b) discretization method applied, and selection 
of a reduct. 



6 Numerical experiments - face recognition 

As a demonstration of role of rough sets methods for feature selection/reduction 
we have carried out numerical experiments of recognition of 13 classes of facial 
images, with 30 different instances for each class. Each gray scale face image was 
of the dimension 112 x 92 pixels. Given original face image set, the first level 
Haar wavelets transform was applied to centered 64 x 64 pixel sub- windows of an 
original face image. The resulting 64 x 64 Haar wavelets coefficient matrix has 
been used to form an original 64 x 64 = 4096 Haar wavelets feature pattern. The 
Haar wavelets patterns have been heuristically reduced to the size 2048. Then, 
according to the proposed method, we have applied Principal Components Anal- 
ysis (PCA) for feature project ion/ reduct ion, followed by the heuristic reduction 
of projected PCA patterns to the length of 200. In the final processing step 
we have applied and rough sets method for the final feature selection/reduction 
based on reduced PCA patterns. The discretized training set was used to find 
the minimal 6-element reduct [1]. This reduct was used to form the final pattern. 
The entire image data set was divided into training, testing sets: 70% of these 
sub-images were used for the training set, 15% for the validation set, and the 
final 15% for the test set. 

The training, validation, and the test sets (decision tables) with real- value pat- 
tern attributes were reduced according to the selected reduct. Classification of 
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face images have been performed by the single hidden layer, 80 neuron, error 
back propagation neural network, with the resulting accuracy 99.24% for the 
test set. 

7 Conclusion 

The sequence of data mining steps, including application of 2D Haar wavelets 
for feature extraction, PC A, and rough sets for projection and feature selection, 
has showed a potential for designing of neural network classifiers for face im- 
ages. Rough sets methods have showed ability to reduce significantly a pattern 
dimensionality. 
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Abstract. This new method of deteeting spots is based on the general eoneept 
of the rough sets. The lower approximation eolleets the set of objeets whieh are 
assigned to a elass in question without any doubt, and the upper approximation 
is eomposed of the lower approximation and the objeets whieh are elassified to 
the elass with some uneertainty. In the initial step all objeets deteeted are 
assigned to the elass of spot-like objeets (i.e. spot eandidates). Subsequent 
refinements tend to extraet spots with higher and higher degree of eertainty, 
based on the lower approximation. Learning system is defined based on the 
rough sets making the learning phase automatie (by exploitation of the lower 
approximation) refining the set of eandidates. 



1 Introduction 

Space operations require one spaeeeraft be mated with another spaeeerafl on an orbit. 
Fig,l shows two spaeeeraft, one near another. Several disks on the spaeeeraft are 
used to define their relative loeations (see Fig.l). Centers of the disks were eomputed 
by an on-board eomputer and then used for the mating operation. Diffieulties are with 
elutter (inel. objeets reminding spots), refleetions, obseurations and shadows. The 
first step in the proeessing is usually to deteet the spot and then to work on the small 
image tile eontaining the spot (e.g., of the dimensions of 32 by 32 pixels) to preeisely 
eompute the spot loeation. Fig. 2 is an example of a 32 by 32 tile with a spot. You 
ean see a real shadow and elutter on it. The eomputer proeessing ean admit several 
spot-like objeets deteeted in addition to real spots whieh ean be further rejeeted by 
using previous knowledge about the relative distanees between the real spots and 
knowledge about expeeted approximate relative loeations of the two spaeeerafts at 
any time. The task is to reduee the number of the extra spot-like objeets deteeted. 
This paper presents Wojeik’s methods to solve this task. 

The possibility of eomputing the position of disk with resolution mueh better 
than one pixel spaeing was diseovered by Z. Wojeik and published in 1976 [2]. The 
resolution aehieved by Wojeik was 0.05 pixel spaeing. The signifieanee of using the 
disk-shaped spot is an inereased resolution of the eamera image by about 20 times in 
the horizontal direetion and 20 times in the vertieal direetion. 



^ The work was under NASA eontraet No. NAS9- 19100 

W. Ziarko and Y. Yao (Eds.): RSCTC 2000, LNAI 2005, pp. 569-576, 2001. 
© Springer-Verlag Berlin Heidelberg 2001 
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2 Rough Sets Model for Detection of Spots in Space Image 



Our rough sets [1] incorporates the Universe (or data) and an equivalence relation R. 
The equivalence class describes the class of the objects looked for. 




Fig.l. The input image with two spaeeeraft. Blaek spots are used to define aeeurate positions 
of the erafts with respeet to eaeh other when mating in spaee 



Connected edges of regions detected by applying subsequent thresholds to a gray- 
level image are used. Figs. 3-4 show examples of such edges of a gray-level image 
(Fig. 2) with a spot. The brightest pixels in Figs. 3-4 represent such edges. The 
regions were detected by using several subsequent thresholds. The set of geometrical 
centers x^,y^ of each connected edge e of each region detected at any threshold is used 
as the Universe for the rough sets analysis. The initial equivalence relation is the 
operation collecting the set of geometrical centers x,y located within the distance d 
from the center x^,y^. The equivalence class [x^,yJ/R^ extracted by R^ is the set of 
geometrical centers within the distance d from x^,y^: 

K>yJ/R<i= |x-xj<d a |y-yj<d}} (1) 

That is, the centroids of the connected edge features lie close together. For 
instance, the equivalence class of a spot-like object is made by each geometrical 
center of the connected circular edges located about the image center in Figs. 3^, 
because they are within the distance d=3 from the connected circular edge shown in 
Fig. 4. 

The cardinal number (count of the elements) of the equivalence class must exceed 
an assumed threshold cThr (e.g. cThr=3): 



Card([x^,yJ/Rj ) > cThr 



( 2 ) 
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Fig.2. The input tile (size of 32 by 32 pixels) 



That is, if Eq.(2) is not satisfied, then we do not consider the candidate equivalence 
class (Eq.(l)) of connected edges to be a valid spot candidate (compare also Figs 3-4). 
The upper approximation US of the set of spots S in an input image is given by the set 
of the geometrical centers satisfying Eq.(2): 

US = { [X ,yjE U: Card([x ,yJ/RJ >cThr } (3) 

The final equivalence relation R detecting the true spots (i.e., the spots and nothing 
but the spots) is unknown. We define the lower approximation ES of the true spots S 
in an input image detected with certainty: 

LS={[x,yJeU: [x,yJ/R £ S} (4) 

The lower approximation is defined for some threshold ITc higher than cThr: 

LS = { [x^,yje U: Card([x^,yJ/RJ >lThr } (5) 

the set of true spots, which is larger than ES and smaller than US, is approached by 
eliminating from US all objects which are not the spots with the aid of additional 
features F (relations on the US). We hope to define the relation R^ and the features 
F so that: 



R=R, n F (6) 

Each spot-like object detected by R^ and subjected to detection of any feature 
f£F may not be eliminated from the set of spots, and still must make a part of the 
US: 
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US,= {[x.,yje U: K,yJ/(R,n f» Pi S 7^ 0 } eUS (7.a) 

(J US, = US (7.b) 

f<^F 




Fig. 3. Input image tile (Fig.2) subjeeted to a threshold. All non-blaek pixels are below the 
threshold seleeted. Then the regions below the threshold were shrunk by one pixel, and finally 
the edges of the shrunken objeets were deteeted. One eonneeted edge followed (gray level 
turned darker) 



where 0 is the empty set. Equations (7.a,b) say, that the set of edges of geometrieal 
centers near x^,y^ extracted by and additionally refined by a proper relation (feature) 
f must still be the member of the class of spot-likes objects US for as long as this set 
of edges represents a true spot. Each feature f refines the equivalence class (Eq.(l)) 
by simply adding an additional constraint f to it in the form of the term [x^,yJ/(R^ f] f). 
This constraint, however, may not eliminate the candidate from the set of spots if the 
candidate is the true spot, what is represented by the condition: 
[x^,yJ/(R^ n f) n S 0 . The set of the candidates [x^,yJ/(R^ H f) Pi S 0 refined 
by the feature f is then still a subset of the upper approximation US of the true spots S. 

We define features F so that the application of all of them makes the true spots S: 
P us, = S 3 LS (8) 

/cF 



An example of the feature f is the ratio of the gray level at the center of the spot 
candidate to the gray level around the spot candidate (below an assumed threshold to 
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remain the member of US). Example of another feature f is the number of zero- 
erossings (transitions of the gradient sign from positive to negative) around the spot 
eandidate (below an assumed threshold to remain the member of US). 

Although it is not demonstrated here, the result US of this spot detection can be 
further fed into a “constellation sieve” or matching with a knowledge database of 
anticipated spot constellations for the final refinement. That process associates a 
detected spot with its physical counterpart on a station structure. 



3 Fast Rough Sets Learning System 

Very effective is a rough sets system, which: 1. finds of the true spots (i.e. the spot- 
like objects with the highest number of concentric edges), then 2. learns about their 
features and measures of the features, and finally 3. rejects from the set of candidates 
all the spot-like objects whose parameters are too far from the parameters of the true 
spots. This learning is from examples of true spots provided by the lower 
approximation given by Eq.(5). For instance, given the upper approximation (i.e., all 
the spot candidates), the average diameter Dav of the spots (i.e., of objects ES^ in the 
upper range of the number of concentric edges) is measured in the first step of the 
refinement: 

ES,=K,yJ/R,= {[x,yjE US: MaxC-Card([x,yJ/RJ <cT) } (8) 

where MaxC is the maximum number of concentric edges of the spot-like objects US, 
Card( [x^,yJ/R^) is the number of concentric edges of the spot-like object defined by 
the equivalence class [x^,yJ/R^, cT is the range in the number of concentric edges 
around the MaxC making the lower approximation, and ES^ is assumed to be the 
essential part of the true spots because of the highest number of concentric edges in 
each object. 



After the first stage of learning about Dav in the set ES^, all those spot-like 
objects are rejected from US whose diameter D is different from Dav by more than an 
assumed threshold dT: 

US,={[x ,yje US: [x ,yJ/(R, f] f=|D-Dav|<cT) fl S ^ 0 } £US (9) 

The set US^^ approaching the true spots no longer has objects whose diameter is 
different from that of the true spots. 

The advantage of using rough sets for the learning system is making the learning 
phase automatic (by exploitation of the lower approximation). The lower 
approximation of the true spots provides automatically the best examples from which 
the system can learn the parameters of the true elements of the class. Eower and 
upper bounds of the parameters are then used to refine the upper approximation to 
detect the true class. 
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Fig. 4. Input image tile (Fig.2) subjeeted to a lower threshold. Regions below the threshold 
were shrunk by one pixel, and edges of the shrunken objeets are represented by the brightest 
pixels. One eonneeted edge followed (gray level turned darker) 



4 Results 

Selected steps in processing a gray-level image of a spot with a bright reflection are 
shown in Figs. 2-4. The methodology uses multiple thresholds [3], based on which 
all binary objects obtained at all thresholds are processed to detect the equivalence 
class of edges. Image shown in Fig. 3 was received by using the first simple 
thresholding producing the first binary representation, then shrinking all the objects 
below the threshold by one pixel, then detection of edges of the shrunk objects. The 
shrinking is by leaving only these pixels representing an object which do not have any 
background pixel in the direct neighborhood. The edge detection is by leaving only 
these pixels representing an object which do have a background pixel in the direct 
neighborhood [3]. This simple edge detector satisfies the underlying requirement for 
edge connectivity exploited in further processing [3]. 

The image presented in Fig. 3 was received by tracing one connected edge [3]. 
Since our edges are always connected, we segment the binary image into objects by 
the edge following algorithm [3]. The algorithm turns to a different color each edge 
traced and moves to a neighbor edge pixel not traced yet [3]. It completes, if there are 
no more edge pixels in the direct neighborhood not traced yet. Note, that shrinking 
helps to separate binary objects from each other. 

Image in Fig.4 was obtained by using a threshold lower than threshold applied to 
image in Fig. 3. Again, shrinking helps to separate binary objects from each other 
(compare three connected edge components represented by the brightest pixels on two 
binary objects). 
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Each connected edge component was traced separately, Fig. 4 shows one 
connected edge traced and two more left for edge tracing. Note that objects for 
smaller thresholds become smaller, and if the object is circular, the geometrical center 
of its connected edge stays in relatively stable location in the image plane. The 
connected edge of large shrunken objeet shown in Fig. 3, for example, was split into 
two smaller objects by subsequent smaller threshold (compare Fig.4), so its 
geometrical center happened to be unstable. This stability of circular connected edges 
is exploited in our algorithm. 

As shown in Figs 2-4, subsequent lower thresholds, shrinking, edge detection and 
edge following result in connected edges of the circular object (spot) at relatively 
stable loeation of its geometrical center, and connected edges of the other objects of 
much less stable geometrical centers. 

Fig.l presents the input space image with spots located on the two spacecraft, and 
Fig. 5 shows the lower approximations of the spots, all marked with the white cross. 
The spot at the image bottom (Fig. 5) was not processed because it was too close to the 
image frame. 




Fig.5. Image with two spacecraft. The true set of the spots detected by a computer program is 
marked with white crosses 
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5 Conclusion 

The equivalence class is a powerful tool capable to detect a class or an object from 
data. The equivalence class of concentric connected edges defines a spot candidate in 
our application. Wojcik’ s method applies the concentricity (the equivalence relation) 
and uses additional features refining the true class (spots) from all candidates (e.g., 
from the upper approximation). High level of concentricity (high value of the 
cardinal number of the equivalence class) defines the true spots (class) but not all the 
set of the true class. High level of concentricity (high range of the cardinal number of 
the equivalence class) defines the lower approximation of the true spots (class). 
Wojcik’ s learning system defines the examples as the lower approximation, then 
learns parameters and features from the examples, computes their ranges, and based 
on them refines the candidates to achieve the true class of objects looked for. The 
rough sets provides a model for both the detection of the spots (class) and for the 
learning system (automatic learning the true examples and then learning from the 
examples). 

Further acknowledgement: The author expresses thanks for Dr. Richard Juday 

(NASA at Houston) for valuable comments to this article and for the supervision of 
the project, and Dr. Mike Rollins for the collaboration. 
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Abstract. This new method of deteeting and eompensating shadow is based 
on the general prineiple of the rough sets. Shadow reeognition is eonstrained 
by the rough sets prineiple aeeording to whieh the upper approximation of 
objeets must eontain non-empty lower approximation - the true elass of objeets 
in question. By imposing this eonstraint, the well known loealization-deteetion 
trade-off is solved. In the first step the shadow is deteeted reliably by using a 
high threshold. Reliable elassifieation (shadow deteetion) with the aid of a high 
threshold makes the lower approximation of shadow. Then, the upper 
approximation is eonstrueted based on the lower approximation (reliably 
deteeted shadow) by using a low threshold. Direetly using a low threshold 
would deteet a lot of elutter and noise rather than shadow. Rough sets prineiple 
prevents this: eaeh shadow eandidate must eontain the lower approximation. 
On the other hand, making the threshold high deteets shadows reliably, but not 
aeeurately, for instanee, shadow frequently begins at a low threshold. Rough 
sets solves this problem by traeking the upper approximation with the aid of a 
low threshold from the lower approximation. 



1 Introduction 

Spacecraft position during the mating operation on an orbit in spaee is determined 
based on centers of the disks attached to the eraft. Shadow passing the disk image 
affects severely the results of computation of the disk eenter. Fig. l.a presents a 32 by 
32 pixel image tile with a disk and a shadow. The task is to neutralize the shadow 
presenee so that the results of the spacecraft positioning are not affeeted by the 
shadow. The required accuraey is 0.05 pixel spacing, so even a small shadow 
gradient has an impact on the results. The discovery made by Z. Wojcik [4] indicates 
that the center of a disk image taken by a camera can be determined with the aeeuraey 
of 0.05 pixel spacing when shadow is not present. 

For a shadow to be compensated it must be recognized first. But shadow is 
difficult to recognize. A contextual knowledge is needed about the seene objects. 
Image factorization can be used for shadow removal [2, 3]: shadow ean be removed 
by appropriate weighting a factor associated with shadow gray level. Only somebody 
must tell which factor is the shadow. In addition, shadow edge is not represented by a 
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W. Ziarko and Y. Yao (Eds.): RSCTC 2000, LNAI 2005, pp. 577-583, 2001. 
© Springer-Verlag Berlin Heidelberg 2001 




578 Z.M. Wojcik 



single factor, and weighting a few factors may weight other image details to 
disappear. 

This research is in a much better situation in which the spot context is very well 
defined. The spot is located on a uniform background, therefore, two significant gray 
level changes on two sides of the spot indicate the presence of a shadow. The two 
gray level changes on the two sides of the spot make an equivalence class. Shadow is 
detected if there is the equivalence class composed of two significant gray level 
changes on two sides of the spot. Other shadow pixels are attached to the equivalence 
class of the shadow detected, with some ambiguity, lying along the assumed shadow 
edge, making a shadow approximation. Ambiguity comes from using now a new 
threshold lower than the threshold used for shadow detection. Shadow upper 
approximation (detected at a lower threshold) is then compensated for based on the 
gray level changes of the equivalence class members. 




2 Rough Sets Model for Detection of Shadow 
around a Spot in Space Image 

Our rough sets [1] incorporates the Universe (or data) U and an equivalence relation 
R. The equivalence relation describes the class of the objects looked for, in our case, 
the two gray level changes indicating the presence of a shadow. More sampling lines 
are allowed but are not used in this research. 

The gray level changes are collected along the four sampling lines 1, r, u, b around 
the spot: on the left, right, upper and bottom of the spot correspondingly. These four 
sampling lines around each spot make the Universe U. Four edges at the square 
inside the image shown in Fig. l.b are examples of the four sampling lines. 
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Maximum ml, mr, mu, mb and minimum (negative values) nl, nr, nu, nb of the 
ehanges in the sampling lines 1, r, u, b are eomputed. 

The equivalenee elass [ml , mr]/Rj^ or [nl , nr]/Rj^ extraeted by the equivalenee 
relation Rj^ is the set eolleeting the two gray level ehanges, positive ml and mr or 
negative nl and nr, in the two sampling lines 1 and r, exeeeding a threshold T : 

[ml, mr]/Rj^ = { l,rE U: ml> T A mr>T } (La) 

[nl, nr]/Rj^ = { l,rG U: nl<- T A nr<- T } (Lb) 

Rj^speeified above deteets horizontal shadows passing spots: with darker side on the 
left side on the spot image, and with darker side on the right side on the spot image 
respeetively. The equivalenee elass [mu , mb]/R^ or [nu , nb]/R^ extraeted by R^ is the 
set of two gray level ehanges in the two sampling lines u and b exeeeding a threshold 
T: 



[mu, mb]/R^ = { u,bG U: mu> T A mb> T } (Le) 

[nu, nb]/R^ = { u,bG U: nu<- T A nb<- T } (Ld) 




Fig. l.b. The input image (Fig. La) after shadow compensation in the spot area 

R^ detects vertical shadows passing spots. The threshold T for the shadow gray 
level changes is relatively high so that there is the full certainty that the shadow is 
detected if it exists. The certainty of the classification to the class of shadows S 
satisfies the definition of the lower approximation of the rough sets. Thus, the lower 
approximation LS of a shadow S is: 



LS = {l,r, u,bGU: [ml,mr]/Rj^^ S VJ [nl,nr]/Rj^^ S 

kJ [mu , mb]/R^ C S kJ [nu , nb]/R^ C S } 



( 2 ) 
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Once the shadow is deteeted reliably looking for high gray level ehanges in the 
plaees where they are not supposed to exist under a regular lighting without shadow, 
the upper approximation of the shadow is traeed from the loeation of its deteetion by 
using a new threshold t lower than T. Equivalenee relations rj^ and r^ now inelude 
thresholding with the aid of a small threshold t and the eonneetivity feature e with the 
lower approximation of shadow. The equivalenee elasses [ml , mr]/ rj^ or [nl , nr]/ rj^ 
of small intensity shadow are the same as Eqs (l.a,b,e,d) with the exeeption that t is 
used instead of T. The upper approximation US of shadow finds aeeurately the 
begins and ends of shadows beeause it works at a low threshold t: 

US={l,r,u,beU[ml,mr]/(r^r|cjnS?^: 0 U [nl , nr]/(r^ P| cj 0 

[mu , mb]/(r^ n c J PlS?^ 0 [nu, nb]/(r^ H c,) PlS?^ 0} (3) 

where ej^ is the operation testing eonneetivity with the eorresponding equivalenee 
elasses [ml, mr]/Rj^ or [nl, nr]/Rj^ deteeting shadow at a high threshold, and e^ is the 
operation testing eonneetivity with the eorresponding high threshold equivalenee 
elasses [mu, mb]/R^ or [nu, nb]/R^. 




Fig.2.a. Another input image with a real shadow passing the spot 

Eq.(3) says, that shadow deteetion with the aid of relations rj^ and r^ thresholding 
gray level ehanges by using a low threshold t must test a eorresponding eonneetivity 
ej^ or e^ with shadow deteeted at a high threshold to still represent shadows (i.e. to 
make a non-empty set of shadows S). The way of implementing Eq.(3) is to deteet 
shadows reliably at a high threshold first and then to traee eaeh high shadow gray 
level ehange to its ends by using a low threshold. 
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Fig. 2.b. The input image from Fig. 2.a after compensating shadow in the spot area 



3 Overcoming the Localization-Detection Trade-Off 
by Using the Rough Sets Principle 

Because of the overwhelming localization-detection trade-off, the direct thresholding 
either detects reliably with a low precision, or does not detect reliably (because the 
results contain in addition a lot of clutter and noise) with a high precision. As shown 
above in section 2, rough sets model splits the shadow recognition task into two 
phases. Phase one uses the lower approximation collecting shadows detected reliably 
by using a high threshold T. The detection of shadows is certain then on a uniform 
background. However, the localization of the beginnings and ends of the shadows is 
not accurate for a high threshold T. When lowering the threshold to detect the 
shadow more accurately on the level of the lower approximation, noise and clutter is 
detected even if shadow does not exist. Keeping in mind that the lower 
approximation detects the shadows reliably, the upper approximation is used. The 
upper approximation involves the lower approximation. Starting from each lower 
approximation, continuity of each shadow reliably detected is traced by using a lower 
threshold. When starting from a low threshold it would not be clear whether shadow 
candidate detected contains the lower approximation of shadow, i.e. the true shadow. 
Rough sets imposes the constraint that the upper approximation can be constructed 
only when the lower approximation is not empty. 
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4 Fast Rough Sets Machine Learning 

The lower approximation eolleets the set of true examples on whieh the system learns 
about the whole elass. Thus, the lower approximation (Eq.(2)) firmly eharaeterizes 
shadow deteeted between two respeetive sampling lines. The upper approximation, 
too, eharaeterizes shadow loeated between two sampling lines by its threshold t < T. 
Thus, shadow is deteeted, measured and is eertainly eharaeterized at all loeations 
between two sampling lines. 

Shadow is eompensated now between the two respeetive sampling lines by using 
the knowledge learned from the two firm examples delivered by the lower 
approximation. These lower approximation examples earry information on the pixels 
at whieh gray level ehange exeeeds threshold T, what indieates the presenee of a 
shadow, and the upper approximation provides all remaining pixels assoeiated with 
the shadow deteeted. These upper approximation pixels represent shadow of a lower 
gray level ehange. 

Charaeteristies of the upper approximation pixels are eomputed at the end of the 
learning phase. These eharaeteristies learned provide speeifie gray level ehanges at 
all pixels making the upper approximation of shadow, based on whieh ratios are 
eomputed of the gray levels of the shadow upper approximation pixels to the gray 
levels of the pixels adjaeent to the upper approximation (without shadow). The upper 
approximation pixels representing shadow (implied between the two sampling lines) 
are eompensated, by multiplying the upper approximation pixels by the reeiproeal of 
the eorresponding ratios. 

Wojeik’s rough sets maehine learning works by defining and eolleeting the lower 
approximation., then gathering (learning) eharaeteristies and features of all elements 
(examples) of the lower approximation, and finally by applying these eharaeteristies 
and features to the eandidate elements to reeognize or proeess the true elass members. 
The eandidate elements in this applieation are the pixels implied by the interpolation 
from the lower approximation of shadow deteeted at the two sampling lines. The 
eharaeteristies learned are the ratios of the gray levels of the upper approximation 
pixels to the pixels adjaeent to the upper approximation. The pixels proeessed 
(eompensated) with the aid of the knowledge learned are the pixels implied by the 
interpolation, as making the shadow. In this applieation the upper approximation 
does not define the eandidates of the true elass, the upper approximation provides 
more aeeurate eharaeteristies of shadow, what overeomes the well-known 
loealization-deteetion trade-off. 



5 Conclusion 

Complexity of the loealization-deteetion trade-off has been brought to eomputing the 
rough sets: finding the lower and upper approximations. The upper approximation ean 
not be found reliably by using direetly a low threshold beeause it is not known if the 
eandidates deteeted eontain the lower approximation. Therefore, Wojeik’s method 
eonstruets the full rough sets representing shadows starting from the non-empty lower 
approximation deteeted reliably at a high threshold. Then, the upper approximation is 
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traced at a low threshold from the lower approximation. The rough sets represents 
shadows reliably and accurately. 

Wojcik’s machine learning assumes that there is the lower approximation 
providing the true examples to find the set of features representing the elass in 
question. Then these features are measured, and the eharacteristics colleeted are used 
as patterns to match the candidates. The candidates are finally refined by these 
characteristies. Finding true examples and the whole learning proeess can be 
automatic if the lower approximation is definable. Automation of the learning 
proeess is the advantage of this machine learning. 
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(NASA at Houston) for valuable comments to this article and for the supervision of 
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Abstract. A modular neural network classifier design is presented. The 
objective behind the design is to enhance the classification performance 
of conventional neural classifiers according to two criteria, namely, re- 
ducing the classification error, and allowing vague /boundary classifica- 
tion decisions. The proposed model uses an unsupervised network to 
decompose the classification task over a number of neural network mod- 
ules. During learning, every module is trained using samples representing 
the other modules, and modules are trained in parallel. After the train- 
ing phase, every module inhibits or enhances the responses of the other 
modules by “voting” for the existence of the input within their deci- 
sion boundaries. If the result of the majority vote is a “tie”, then the 
sample is classified as a vague class (or boundary) between the (two or 
more) classes that have the tie. The proposed classifier is tested using 
a two-dimensional illustrative benchmark classification problem. Results 
are showing an enhancement in the classification performance according 
to the above two criteria. 



1 Introduction 

The definition of rough sets [9] made several contributions to the field of 
classification, pattern recognition and knowledge discovery [5]. Defining 
“vague classes” is one of those contributions. Often, in real life systems, 
there are patterns/objects/attributes that cannot be naturally classified 
as belonging to any specific category. This is not because of a deficiency 
in the classification system. Even a human-expert would fail to classify 
such patterns. As an example, there were samples in the NCR’s numerals 
benchmark problem, we previously used in [1], that can never be definitely 
classified as a digit “7” versus a digit “1” , or a digit “8” versus a digit “9” , 
and so on. It is more realistic, and more accurate, for the classification 
system to identify these patterns as “boundary” or “vague” and, have 
this as the final classification decision for such samples. 
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The current generation of NN classifiers suffer from a major drawback; 
their inability to cope with the increase of size and/or complexity of the 
classification task [2]. This is often referred to as the scalability problem. 
Large networks tend to introduce high internal interference because of 
the strong coupling among their hidden-layer weights [7]. 

This paper introduces a modular neural network structure that en- 
hances the classification performance. The model uses a divide-and-conquer 
approach for decomposing the classification task into a group of simpler 
sub-tasks. Each classification sub-task is handled by a simple, fast and 
efficient module. Then, sub-solutions are integrated via a multi-module 
decision-making strategy, which has the ability to classify a tested sample 
as a “vague class”, or boundary, between two or more classes. 

Section 2 describes the proposed model and the theoretical basis be- 
hind the different design aspects. Section 3 describes the preliminary ex- 
perimental study that is carried out to prove the merit of the proposed 
model. Section 4 summarizes the paper’s conclusions and outlines some 
future work. 

2 The Proposed Modular Network 

First, an unsupervised network is used for task-decomposition, i.e. sub- 
groups of classes are assigned to small modules rather than classifying 
all classes using one large non-modular network. During learning, each 
module is trained to classify its own group of classes. What is more than 
this simple divide-and-conquer idea is that every module is trained using 
samples representing the other modules (groups) as well. The structure, 
therefore, of a module’s output layer consists of “class outputs” {Ci) equal 
to the number of classes in the group, plus “group outputs” {Oij) equal to 
the number of the “other” modules, i.e. number-of-modules— 1 (Fig. 1). 

If the training sample is in one of the module’s classes, its output bit 
is high {Ci — 1.0) while all other outputs pointing to other classes in the 
group and other groups are low (0.0). Otherwise, the output of module i 
should point to the bit representing one other module j {Oij — 1.0), with 
the others are low (0.0). 

During testing, every module inhibits or enhances the responses of the 
other modules by voting for the existence of the input within their decision 
boundaries. Group outputs (votes) approach 1.0 or 0.0 according to how 
near or far, respectively, the sample is from the corresponding class in 
the feature space. This “cooperation” in taking the decision takes place 
above the modules’ output layer in the voting block (Fig. 1). Multiple 
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FINAL CLASSIFICATION DECISION 



VOTING LAYER 




OUTPUT 

LAYERS 



HIDDEN 

LAYERS 



Fig. 1. An example of the proposed modular neural network. 



neural network modules cooperating in taking a classification decision 
are modeled as multiple voters electing one candidate in a single ballot 
election. All modules are considered “candidates” and “voters.” Voting 
“bids” are the different group outputs, and the highest for each module 
is considered the module’s vote. Plurality (Majority) voting is the most 
common voting scheme in real-life collective decision-making processes 
[8, 10]. Each voter votes for one alternative, and the alternative with the 
largest number of votes wins. 

The advantage of this scheme, from the NN perspective, is that it only 
uses the highest output value, which is the most probable output to be 
true, even if its value is way below “1.0”. Note that the probability of 
correctness of a certain class is not proportional to the corresponding 
output value. Empirically, according to [1] and also [6], we noticed that 
the probability of having one of the lower outputs as the correct output is 
very low, unless the NN needs more training. Therefore, we consider the 
lower outputs as “information noise” , and rely only on the highest value 
for the module’s vote. For a comparison of the different voting schemes 
applied to NN-classifiers decision-making, refer to [3]. 

Therefore, the decision making process can be summarized as follows. 

1. The voting strategy determines the group/module that is more likely 
to contain the tested sample within its decision boundaries. 

2. If the result of the majority vote is a “tie” , then the sample is classified 
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as a vague class (or boundary) between the classes (two or more) that 
have the tie. 

3. If the result of the majority vote clearly defines a winner-module, 
the maximum class-output within this module is taken as the final 
decision. 

Without allowing vague/boundary classes, the only way to take a clas- 
sification decision in case of a tie vote is the random choice between the 
winner classes. This will, naturally, cause a lot of classification errors. For 
example, the random class choice is likely to give a 50% error in case of a 
boundary between two classes. This is a drawback in all modular designs 
that use the majority vote as a multi-module decision-making strategy, 
for example, [3] and [4]. Therefore, the vague classification decision will 
help reduce misclassifications. Another advantage of this decision mak- 
ing process is that it makes the classification system more realistic and 
practical, as outlined in Section 1. 

3 Experiments 

We created a two-feature/attribute classification problem to use as a 
benchmark. A two-dimensional input space gives an illustration of the 
shapes of the decision boundaries. Samples are randomly generated from 
Gaussian distributions around 20 random class means (Fig. 2). Each class 
consists of 200 samples, half for training and half for testing. ART2 and 
Backpropagation (BP) schemes are used for the unsupervised and super- 
vised networks, respectively. 

The modular network is compared to the non-modular network (2 
inputs and 20 outputs) classifying the same data. To guarantee a fair 
comparison, the BP learning parameters, the algorithm for terminating 
training, and the criteria for determining the number of hidden nodes 
are unified across the modular and the non-modular neural networks. 
The unsupervised task-decomposition technique clustered the data into 
12 groups, namely, 1-3-4, 2, 5, 6-7, 8, 9, 10, 11-12-14, 13, 15-19-20, 16-17, 
and 18. All modules are trained in parallel, and they all use an equal 
number of training samples per class, or group, output. 

Table 1 summarizes the results. The proposed modular structure de- 
creased the classification errors of the non-modular neural network by 
73.3% of the errors, i.e., from 72.4% to 92.65% correct classification rate. 
Given the accuracy of the voting scheme in identifying modules (groups) , 
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Fig. 2. The 20-class 2-dimensional classification data. 



Network 


^ Miscl. samples 


Correct cl. % 


Non-modular BP 


552 


72.4% 


Modular network 


147 


92.65% 


Modular network (with vague classes) 


98 


95.1% 



Table 1. A summary of the results. 



and the efficiency of neural modules dealing with smaller and more homo- 
geneous sub-tasks, decision boundaries are drawn much more accurately 
than the non-modular network that uses the same supervised learning 
scheme. 

About 36% of the samples caused a tie bet veen 2 or more of the 
voting modules. Without alio wing vague/boundary decision, the classifi- 
cation outcome would depend on a random choice for the output class. 
This caused a percentage of errors that were eliminated by the vague-class 
decision. Hence, the percentage of certain/lower approximation classifi- 
cation increased to 95.1% for this benchmark. 
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4 Conclusion 

A modular neural network classifier design is presented. The objective 
behind the design is to enhance the classification performance of con- 
ventional neural classifiers according to two criteria, namely, reducing 
the classification error, and allowing vague/boundary classification deci- 
sions. The proposed classifier is tested using a two-dimensional illustrative 
benchmark classification problem. Results are showing an enhancement in 
the classification performance. The new model’s success is due to its multi- 
stage task-decomposition, and utilizing all modules’ information through 
cooperation and voting. Future versions of the model will integrate more 
rough set concepts in the design. The system will select specific attributes 
for each module that best differentiate between its classes (similar to [9]). 
This is believed to increase the modules’ ability for accurate classification 
and to further enhance the performance of the neural classifier. 

References 

1. G. Auda. Cooperative modular neural network classifiers. PhD thesis, University 
of Waterloo, 1996. 

2. G. Auda, M. Kamel, and H. Raafat. A new neural network structure with cooper- 
ative modules. In World Congress on Computational Intelligence^ volume 3, pages 
1301-1306, Florida, USA, June 1994. 

3. G. Auda, M. Kamel, and H. Raafat. Voting schemes for cooperative neural net- 
work classifiers. In IEEE International Conference on Neural Networks, ICNN^95, 
volume 3, pages 1240-1243, Perth, Australia, November 1995. 

4. R. Battiti and A. Colla. Democracy in neural nets: Voting schemes for classifica- 
tion. Neural Networks, 7(4):691-707, 1994. 

5. W. Ziarko (ed.). Rough sets, fuzzy sets and knowledge discovery. Springer- Verlag, 
1993. 

6. G. Goetsch. Maximization of mutual information in a context sensitive neural 
network. Technical Report CMU-CS-90-168, Carnegie Mellon University, Sept. 
1990. 

7. R. Jacobs, M. Jordan, and A. Barto. Task decomposition through competition in 
a modular con nectionist architecture: The what and where vision tasks. Neural 
Computation, 3:79-87, 1991. 

8. H. Normi. Comparing voting systems. D. Reidel Publishing Company, 1987. 

9. Z. Pawlak. Rough sets: Theoretical aspects of reasoning about data. Kluwer Aca- 
demic Publishers, 1991. 

10. P. Straffin. Topics on the theory of voting. The UMAP Expository Monograph 
Series, Birkhauser, 1980. 




Evolutionary Parsing for a Probabilistic Context 

Free Grammar 



L. Araujo 

Dpto. Sistemas Informaticos y Programacion. Universidad Complutense de Madrid. 

Spain, lurdes@sip.ucm.es 



Abstract. Classic parsing methods are based on complete search tech- 
niques to hnd the different interpretations of a sentence. However, the 
size of the search space increases exponentially with the length of the 
sentence or text to be parsed, so that exhaustive search methods can fail 
to reach a solution in a reasonable time. Nevertheless, large problems can 
be solved approximately by some kind of stochastic techniques, which do 
not guarantee the optimum value, but allow adjusting the probability of 
error by increasing the number of points explored. Genetic Algorithms 
are among such techniques. This paper describes a probabilistic natural 
language parser based on a genetic algorithm. The algorithm works with 
a population of possible parsings for a given sentence and grammar, 
which represent the chromosomes. The algorithm produces successive 
generations of individuals, computing their “htness” at each step and se- 
lecting the best of them when the termination condition is reached. The 
paper deals with the main issues arising in the algorithm: chromosome 
representation and evaluation, selection and replacement strategies, and 
design of genetic operators for crossover and mutation. The model has 
been implemented, and the results obtained for a number of sentences 
are presented. 

keywords: Evolutionary programming, Parsing, Probabilistic Grammar 

1 Introduction 

Classic parsing methods are based on complete search techniques to find the 
different interpretations of a sentence. However, experiments on human parsing 
suggest that people do not perform a complete search of the grammar while 
parsing. On the contrary, human parsing seems to be closer to a heuristic pro- 
cess with some random component. This suggest exploring alternative search 
methods in order to improve the efficiency. Another central point when parsing 
is the need of selecting the “most” correct parsing from the multitude of pos- 
sible parsings consistent with the grammar, fn such a situation, some kind of 
disambiguation is required. Statistical parsing helps to tackle the previous ques- 
tions, that is, avoids an exhaustive search and provides a way of dealing with 
disambiguation. 

Stochastic grammars [1], obtained by supplementing the elements of alge- 
braic grammars with probabilities, represent an important part of the statistical 
methods in computational linguistics. They have allowed important advances in 
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areas such as disambiguation and error correction. Another stochastic methods 
are genetic algorithms (GAs). They have been already applied to different is- 
sues of natural language processing. Davis and Dunning [3] use them for query 
translation in a multi-lingual information retrieval system. GAs have also been 
applied to the inference of context-free grammars [2]. Wyard [6] devised a genetic 
algorithm for the language of correctly balanced nested parentheses, while Smith 
and Witten [5] proposed a genetic algorithm for the induction of non recursive 
s-expressions. 

This paper presents a stochastic parser based on a genetic algorithm which 
works with a population of possible parsings. The algorithm produces successive 
generations of individuals, computing their “fitness” at each step and selecting 
the best of them when the termination condition arises. Apart from the char- 
acteristic efficiency of these stochastic methods, the nature of the generation of 
solutions in a genetic algorithm brings the advantages of statistical approaches. 

The rest of the paper proceeds as follows: Section 2 describes the evolutionary 
parser, presenting the main elements of the genetic algorithm; section 3 presents 
and discusses the experimental results, and section 4 draws the main conclusions 
of this work. 



2 Evolutionary Parsing 

The syntactic structure of a sentence is a necessary previous step to determine 
its meaning. Such structures assign a syntactic category (verb, noun, etc) to 
each word in the sentence and specify how these categories are clustered to 
form higher level categories (np, vp, etc) until building the whole sentence. The 
grammar specifies the permitted structures in a language. Gontext free gram- 
mars (GFGs), whose rules present a single symbol on the left-hand-side, are 
a sufficiently powerful formalism to describe most of the structure in natural 
language, while at the same time is sufficiently restricted as to allow efficient 
parsing. 

Parsing according to a grammar amounts to assigning one or more structures 
to a given sentence of the language the grammar defines. If there are sentences 
with more than one structure, as in natural language, the grammar is ambiguous. 
Parsing can be sought as a search process that looks for correct structures for 
the input sentence. Besides, if we can establish some kind of preference between 
the set of correct structures, the process can be regarded as an optimization 
one. This suggests considering evolutionary programming techniques, which are 
acknowledged to be practical search and optimization methods [4] . 

Probabilistic grammars [1] offer a way to establish preferences between pars- 
ings. In a probabilistic GFG a weight is assigned to each rule in the grammar. 
The probability of each parsing is the product of the probabilities of all the 
rules used in the parsing. Probabilistic grammars not only offer a way to deal 
with issues such as ambiguity or ungrammaticality [1], but can also lead to an 
improvement in performance. Genetic algorithms and probabilistic grammars 
complement each other, for at least two reasons: 
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a) Large populations in a GA lead to a higher diversity at the expense of slowing 
down the convergence process, while higher percentages in the applications 
of genetic operators hasten the process but increase the selective pressure. 
The use of probabilistic grammars help to accelerate the convergence pro- 
cess. Although the selective pressure is increased for individuals composed 
of grammar rules of high probability, this will lead to better individuals for 
most sentences (since they will correspond to the most probable rules). Thus, 
in general there will not be a premature convergence to a wrong individual. 

b) The nature of the GAs, which favours the exploration of new areas of the 
search space, helps to reach a correct result, even if the sentence to parse 
requires applying rules of low probability. 

According to the previous considerations, a probabilistic GA has been designed, 
in which the parsings that compose the population correspond to a probabilistic 
GFG. When the algorithm finishes with correct parsings, the one for which the 
product of the probabilities of its genes is the largest is chosen. This is the answer 
of the algorithm for the most probable parsing of the sentence. 



2.1 Chromosome Representation 

Our system chromosomes represent parsings for the input sentence, correspond- 
ing to a fixed context-free grammar. The input sentence is given as a sequence 
of words with their set of categories attached to them (if they belong to several 
categories every of them is added). Nevertheless, this information could be easily 
obtained from a lexicon in a preprocessing step. Let us consider a simple exam- 
ple. The sentence “the man sings a song” will be given as the(Det) man(Noun) 
sings(Verh) a(Det) song(Noun). 

A chromosome is represented as a data structure containing the following 
information: 

— Fitness of the chromosome. 

— A list of genes^ which represents the parsing of different sets of words in the 
sentence. 

— The number of genes in the chromosome. 

— The depth of the parsing tree. 

Each gene represents the parsing of a consecutive set of words in the sentence. 
If this parsing involves no terminal symbols, the parsing of the subsequent par- 
titions of the set of words is given in later genes. Accordingly, the information 
contained in a gene is the following: 

— The sequence of words in the sentence to be analyzed by the gene. It is 
represented by two data: the position in the sentence of the first word in the 
sequence, and the number of words of the sequence. 

— The rule of the grammar used to parse the words in the gene. 

— If the right hand side of the rule contains no terminal symbols, the gene also 
stores the list of references to the genes corresponding to the parsing of these 
symbols. 
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NP : The man 


NP : The man 


VP: sings a song 


VP: sings a song 


NP -> Det,NP: 


NP -> Adj,NP: 


Det : The 


Adj: The 


NP: man 


NP: man 


NP -> Noun 


NP -> Noun 


Noun: man 


Noun: man 


VP -> Verb, NP: 


VP -> Verb, PP: 


Verb: sings 


Verb: sings 


NP : a song 


PP: a song 


NP -> NP, AP 


PP -> Prep, NP 


NP -> Noun 


Prep : a 


Noun: a 


NP -> Noun 


AP -> Adj 


Noun: song 


Adj : song 

Chromosome 1 


Chromosome 2 



Fig. 1. Possible chromosomes for the sentence The w.an sings a song. NP stands for 
nominal phrase, VP for verb phrase, Det for determiner, Adj for adjective, PP for 
prepositional phrase and AP for adjective phrase. 



— The depth of the node corresponding to the gene in the parsing tree. It will 
be used in the evaluation function. 

Figure 1 presents some possible chromosomes for the sentence of the example. 



Initial Population The initial population consists of PS randomly generated 
(according to the probabilities of the different rules) individuals. The steps for 
the creation of chromosomes in the initial population are the following: 

— The set of words in the sentence is randomly partitioned, making sure that 
there is at least one verb in the second part, which corresponds to the main 
VP. 

— The set of words corresponding to the NP is parsed by randomly generating 
(consistently with the assigned probabilities) any of the possible NP rules. 
The same is done for generating the parsing of the VP with the VP rules. 
The process is improved by enforcing the application of those rules able to 
parse the right number of words of the gene. 

— If the rules applied contain some non terminal symbol in its right hand side, 
the parsing process is applied to the set of words which are not yet assigned 
a category. 

— The process continues until there are no terminal symbols left pending to be 
parsed. 
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2.2 Fitness: Chromosome Evaluation 



Adaptation of individuals is revised after each new generation, testing the ability 
of every chromosome to parse the objective sentence. The evaluation of individ- 
uals is a crucial point in the evolutionary algorithms since the opportunities of 
an individual for survival depends on its fitness. 

Fitness is computed as 



fitness 



Number of coherent genes — Xl^eincoherent 
Total number of genes 



penalization 
genes depth(p 



This formula is based on the relative number of coherent genes. A gene will be 
considered coherent if 



a) it corresponds to a rule whose right hand side is only composed by terminal 
symbols, and they correspond to the categories of the words to be parsed by 
the rule. 

b) it corresponds to a rule with non-terminal symbols in its right hand side and 
each of them is parsed by a coherent gene. 

The formula takes into account the relative relevance of the genes: the higher in 
the parsing tree is the node corresponding to an incoherent gene, the worse is the 
parsing. Thus the fitness formula presents a penalization factor which decreases 
with the depth of the gene. 



2.3 Genetic Operators 

Chromosomes in the population of subsequent generations which did not appear 
in the previous one are created by means of two genetic operators: crossover 
and mutation. The crossover operator combines two parsings to generate a new 
one; mutation creates a new parsing by replacing a randomly selected gene in a 
previous chromosome. The rates of crossovers and mutations performed at each 
step are input parameters. The efficiency of parsing is very sensitive to them. At 
each generation a number of chromosomes equal to the number of offsprings is 
selected to be replaced. The selection is performed with respect to the relative 
fitness of the individuals: a chromosome with a worse than average fitness has 
higher chances to be selected for replacement. On the contrary, chromosomes 
adapted over the average have higher probability to be selected for reproduction. 



Reproduction Crossover operator generates a new chromosome that is added 
to the population in the new generation. The part of one parent after a point 
randomly selected is exchanged with the corresponding part of the other parent 
to produce two offsprings, under the constraint that the genes exchanged corre- 
spond to the same type of parsing symbol (NP, VP, etc) in order to avoid wrong 
references of previous genes in the chromosome. Of course those exchanged which 
produce parsings inconsistent with the number of words in the sentence must be 
avoided. Therefore, the crossover operation performs the following steps: 




Evolutionary Parsing for a Probabilistic Context Free Grammar 595 

— Select two parent chromosomes, C\ and C‘2- 

— Randomly select a word from the input sentence. 

— Identify the inner most gene to which the selected word corresponds in each 
parent chromosome. 

— If the genes correspond to different sets of words, the next gene in the inner 
most order is selected. This process continues until the sequences of words 
whose parsings are to be exchanged are the same, or until the main NP or 
VP are reached. 

— If the two selected genes parse the same sequence of words the exchange is 
performed. 

— If the process to select genes lead to the main NP or VP, and the sequence of 
words do not match yet, the exchange can not be performed. In this case a 
new procedure is followed: in each parent one of the two halves is maintained 
while the other one is randomly generated to produce a parsing consistent 
with the number of words of the sentence. This produces four offsprings, out 
of which the best is selected. 

— Finally, the offspring chromosome is added to the population. 



Mutation Selection for mutation is done in inverse proportion to the fitness 
of a chromosome. Mutation operation changes the parsing of some randomly 
chosen sequence of words. The mutation operation performs the following steps: 

— A gene is randomly chosen from the chromosome. 

— The parsing of the selected gene, as well as every gene corresponding to its 
decomposition, are erased. 

— A new parsing is generated for the selected gene. 



3 Experimental Results 

The algorithm has been implemented using C language and run on a Pentium II 
processor. In order to evaluate its performance we have considered the parsing 
of the sentences appearing in Table 1. The average length of the sentences is 
around 10 words. However, they present different complexities for the parsing, 
mainly the length and the number of subordinate phrases. 



Jack (noun) regretted(verb) that(wh) he (pro) ate (verb) the(det) whole (adj)| 
thing (noun) 

The(det) man(noun) who(wh) gave(verb) Bill(noun) the(det) money(noun) 
drives(verb) a(det) big(adj) car(noun) 

The(det) man(noun) who(wh) lives (verb) in(prep) the(det) red(adj)| 
house(noun) saw(verb) the(det) thieves(noun) in(prep) the(det) bank(noun) 



Table 1. Sentences used in the parsing experiments. 
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The results reported in this section have been obtained as the average of 
five runnings with different seeds. Results show that in most cases the correct 
parsing is reached in a small number of steps, less than 10 for populations of 
above 300 individuals. 

Results obtained with a deterministic context free grammar are compared 
to the ones obtained using a probabilistic grammar. Figure 2 shows the results 
obtained for the sentences when using rates of crossover of 50% and one of 
mutation of 20%. Results clearly improve in all the cases by using the prob- 
abilistic grammar. The first observation is that while the deterministic CFG 
produces irregular convergence processes, the probabilistic one leads to highly 
regular processes as the population size grows. This indicates a higher robust- 
ness of the genetic algorithm. The difference between the results from the two 
kinds of grammar increases with the complexity of the sentence. Thus while the 
deterministic CFG leads to quick convergence for the sentence 1, the process is 
quite irregular for sentences 2 and 3. Another observation is that a threshold 
population size is required to achieve convergence. 




Population Size 



Fig. 2. Number of iteration required to reach the correct parsing with a probabilistic 
grammar (P) and a deterministic grammar (D). 



The most relevant GA parameters have been studied (data not shown) . R is 
clear that the population diversity and the selection pressure are related to the 
population size. If the population size is too small the genetic algorithm will con- 
verge too quickly to a bad result (all individuals correspond to similar incorrect 
parsings), but if it is too large the GA will take too long to converge. Results 
show that the behavior is quite different for each sentence: the higher the “sen- 
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fence complexity”, the larger the population size required to reach the correct 
parsing in a reasonable number of steps. The sentence complexity depends on 
its length and on the number of subordinate phrases it contains. Besides, as the 
population size increases, higher rates of crossover and mutation are required to 
increase the efficiency of the algorithm. 

4 Conclusions 

This paper presents a genetic algorithm that adapts a population of possible 
parsings for a given input sentence and a given grammar. Genetic algorithms 
allow a statistical treatment of the parsing process, providing at the same time 
the typical efficiency of stochastic methods. 

Results from a number of tests indicate that the GA is a robust approach for 
parsing positive examples of natural language. A number of issues of the GA have 
been tackled, such as the design of the genetic operators and a study of the GA 
parameters. The tests indicate that the GA parameters need to be suitable for the 
input sentence complexity. The more complex the sentence (length and number 
of subordinate phrases), the larger the population size required to quickly reach 
a correct parsing. 

Probabilistic grammars and genetic algorithms have been shown to comple- 
ment each other. The use of a probabilist context free grammar instead of a 
deterministic one for the generation of the population of parsings in the algo- 
rithm has been investigated. Results obtained for these experiments show a clear 
improvement in the performance. For short sentences, though, greedy parser al- 
gorithm can be at least as fast. Nevertheless, the method proposed herein also 
allows dealing with problems such as ambiguity or ungrammaticality, and are 
expected to be advantageous for parsing long texts. Work along this line is cur- 
rently in process. 
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Abstract. Milk yield foreeasting ean help dairy farmers to deal with the 
eontinuously ehanging eondition all year round and to reduee the unneeessary 
overheads. Several variables (somatie eell eount, pariety, day in milk, milk 
protein eontent, milk fat eontent, season) related to milk yield are eolleeted as 
the parameters of the foreeasting model. The use of an improved Genetie 
Programming (GP) teehnique with dynamie learning operators is proposed and 
aehieved with aeeeptable predietion results. 

Keywords: Genetie programming, dynamie mutation, milk yield predietion 



1 Introduction 

In Taiwan, milk consumption is getting more popular in the last few decades by the 
preaching from the government and the awareness of consumer right movement for 
better nutrition. Due to neglecting the importance of predicting the milk yield, dairy 
farmers usually could not deal well with the continuously changing condition and 
have high uncontrollable overheads of the production system. Many research works 
have tried to solve this problem by using traditional approaches such as regression 
and time series analysis. But they are restricted by the missing or incomplete data and 
may not generate sufficiently accurate results in the effort of milk yield prediction. 
According to previous studies (Dun, 1980; Wu, 1989; Tseng, 1992; Hu, 1994; Mo, 
1996), it can be found that the nature of the milk yield-forecasting matter is complex, 
nonlinear, and continuous. Therefore the development of mathematical models by 
statistic methods for this effort may be difficult or complicated and lack of learning 
and adaptation capabilities in recognizing the behavior of data set. 

Genetic Algorithms (GA) are heuristic-based search optimization techniques 
rooted on the principles of natural evolution (Holland, 1975). As an extension of GA 
paradigm. Genetic Programming (GP) is able to automatically construct computer 
programs by means of the Darwinian theory of natural selection (Koza, 1982). GP 
does not use an encoding of the problem into a finite alphabet string and does not 
require an assumption of any functional relationship between independent and 
dependent variables. This technique has been widely applied to classification, 



W. Ziarko and Y. Yao (Eds.): RSCTC 2000, LNAI 2005, pp. 598-602, 2001. 
© Springer-Verlag Berlin Heidelberg 2001 




The Application of Genetic Programming in Milk Yield Prediction for Dairy Cows 599 



forecasting, and model construction problems (Dworman et al, 1996; Lee et al, 1997; 
Jan, 1998). This research adopts an improved dynamic learning mechanism in GP that 
is able to construct a forecasting formula more efficiently (Chiu, 1999). 



2 Basic Principles of GP 

GP involves creating a mathematical or logical expression, in symbolic form, that 
provides a good, best, or perfect fitness between a given finite sampling of values of 
the independent variables and the associated values of the dependent variables. In 
other words, GP involves creating a model fitting a given sample of data. When the 
variables are real-valued, GP finds both the functional form and the numeric 
coefficients for the model. GP differs from conventional linear, quadratic, or 
polynomial regression, which merely finds the numeric coefficients for a function. GP 
expresses the model in a tree-type structure that allows expressing mathematical, 
logical, or functional structures within one expression tree. The tree structure follows 
the so-called “Polish notation” which puts the operator (the functional form) first and 
then the operands (the numeric coefficients). Thus, for example, the simple 
mathematical function y = 5*x+3 can be represented as y = +(*(5,x), 3) or in tree 
structure shown in Figure 1. In applying GP to a problem, the user must define all the 
possible operator set and operand set used as nodes in a tree. The set of operators is 
to be used to generate the mathematical expression that attempts to fit the given finite 
sample of data. The set of operands (along with the set of operators) is the ingredient 
from which GP attempts to construct a model to solve, or approximately solve, the 
problem. 




Because of the tree-based representation, the three main genetic operators, 
selection, crossover and mutation, work differently from the way GA are constructed. 
In the selection operation, couples of parent trees are selected for reproduction on the 
basis of their fitness. The most usual selection mechanisms are Fitness Proportionate 
Reproduction, and Tournament Selection. In the crossover operation, two parent trees 
are sexually combined to form two new offspring. The parents are picked in random 
and parts of their trees are exchanged. Since different mutation operator may exhibit 
different impact on the evolution performance, this paper proposes the mixture 
approach that synergies the evolution advantages from multiple mutation operators 
(Chiu, 1999). 
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3 The Proposed Methodology 

Basically the first step of mutation operation is to determine the mutation node. 
Depending upon the problem eharacteristies, the operation may apply to the following 
three types of mutation subjeets single node, multiple nodes, or a sub-tree in programs 
under eertain eonstraints of evolutionary consideration. Assume 01, 02, 03, are 
defined as the three mutation operators that apply to the three type of mutation 
subjeets respectively. The basie principle of this approaeh is that, for the sake of 
encouraging the operator that is resulting better offspring fitness values during 
evolution, the mutation rate assigned to its eorresponding operator is thereby 
strategieally being inereased. On the other hand, if the operator doesn’t produee better 
results, its mutation rate is therefore deereased. Eaeh mutation rate designated to its 
eorresponding operator is dynamieally re-allotted aceording to individual 
performanee. Let r be the pre-defmed mutation rate, and thus eaeh operator is initially 
assigned with mutation rate r/3. A simpler form of the same erossover point in both 
program parents is adopted here. Right after mutation operation, eaeh mutation 
operator is being evacuated and re-ranked aecording to individual performanee. 
Among these three operators whose mutation rate of the best is inereased by w% of 
r/3, the seeondary is remained same of its present rate, while the worst one is 
deereased by w% of r/3. Though the adjustment proeesses, the mutation ratios are 
updated dynamieally. The detailed algorithm ean be found in (Chiu, 1999): 



4 The Experiments and Results 

Milk produetion data along with physiologieal data of eows were obtained from the 
Experimental Farm in College of Agrieulture, National Taiwan University. Aeeording 
to previous studies, those faetors including somatic cell eount, pariety, day in milk, 
milk protein content, milk fat content, and month of the year are related to the milk 
yield (Mo, 1996; Wu, 1989). Thus data that eonsisting of the above faetors had been 
eolleeted from 30 from 30 Holstein eows that are of same speeies every month. 
Training data comprises 643 data reeords that are derived between August 1994 and 
July 1996; and testing data eomprises 532 data reeords that are from September 1996 
to September 1998. There were data missed in August, November 1996, April, May, 
June, July, August, September, November 1997. We pre-proeess the data of the 
month variable and elassify 12 months into groups by elustering Deeember, January, 
February, March as 1, April, May, June, Oetober, November as 2, and July, August, 
September as 3 aeeording to the climate eharacteristies of eaeh month. Sinee GP 
randomly searches the model, the operator set, the operand set, and the parameters for 
the training proeess have to be determined in advanee. The operator set ineludes +, -, 
*, and / (prevent the division from zero denominators). The parameters for the 
training proeess are defined as follows: Population Size: 150; Crossover Rate: 0.6; 
Mutation Rate (r): 0.2; Simplifieation of Solutions: True; Termination Criterion: 5000 
generations. The computer system used for this process was a single proeessor Intel 
Pentium III 450 MHz system with 1 28MB RAM, running on Mierosoft 98. 

In the training proeess, every ereated solution has a fitness value to evaluate the 
performanee. In this study, the mean error for the best individual in every generation 
is used for model evaluation. The learning eurve of the model is flat approximately 
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between the 200^^ generation and the 750^^ generation; seeming that the model has 
already converged. But the model creates better solutions after the 750^^ generation. 
Although GP could not generate models with zero error, increasing the generation 
size may be a way to gain better model. The mean error of the last best generation is 
4.278. Though the predicted results do not exactly coincide with the actual data, they 
conform to the entire trends of expected data. Also the comparative predicting 
performance between GP and regression method is depicted in Table 1. We will 
comment these results in Section 5. 



Table 1. The comparative predicting performance between GP and Regression 



Comparative 

Performance 


Correlation Coefficient 


Absolute Mean Error 


GP 


0.695 


4.676 


Regression 


0.693 


5.512 



The derived formula 

The model produced by the GP process for the milk yield is as follows: 

(46.9757563391006+((-0.00260077965386069*((Xl^l)*(X3^1)))+ 

((-6.29595132173721*(X4^1))+((-0.0183952688341093*(X6^5))+ 

((-3.28103058073334E-5*(X3/^2))+((-0.115769461379268*(X6^2))+ 

((-0.00731554801232839*(X3^1))+((-0.00618269438233685*(X4M))+ 

((-0.0248432320556369*(X4/^3))+((0.00025459980864606*(Xl^l))+ 

((-9.12858920103003E-5*(X2/^5))+((-0.00261453182147923*(X5^3))+ 

((0.0190437641263939*(X6M))+((5.62962747051406E-7*((X3^2)*(X4/^3)))+ 

(0.0481493551 104286*(X5^1)))))))))))))))) 

where: XI: somatic cell count; X2: parietyl; X3: day in milk; X4: milk protein content 
X5: milk fat content; X6: month of the year 



5 Dsicussion and Conclusions 

In this paper, we proposed a genetic programming-based approach to model the milk 
yield prediction problem. By introducing a dynamic mutation operator method in GP’s 
learning model, acceptable results have been achieved with the best mean absolute 
errors of 4.278 for training process and of 4.676 for testing process. According to 
Table 1, both of the predicted errors are smaller than 5.512 obtained from the 
regression model. Furthermore, as to the correlation coefficient, GP exhibits better 
performance than regression model (0.695 vs. 0.693). It can be seen that the predicted 
milk yield derived from GP model approximately conforms to the actual data record. 
As to certain coefficient of E-20 accuracy from the complicated formula derived, it 
can be improved in the future research by reducing with less accuracy in order to 
produce simpler while better formulae. This is because so much time on very small 
mutations of coefficients can be saved and also is able to direct more of the search 
towards the structure of the function. It is undoubtedly that there might exist 
influencing factors other than those have been collected in this research (Holmes et 
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al., 1987; Schmidt et ah, 1988; Skidmore et ah, 1996; Olori, 1997; Olsson et al. 
1998). Efforts in accumulating more data as well as collecting data from other sourees 
to improve prediction accuracy are undergoing in our dairy farms. 
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Abstract. The issue of localization of sound sources for videoeonfereneing is 
discussed in the paper. A new algorithm for estimating speaker locations, based 
on reeurrent neural networks (RNN), is introdueed and described. The seheme 
of experiments carried out in an acoustically adopted ehamber, exploiting the 
engineered method is detailed. 



1 Introduction 

Localization of sound sources is a key issue in contemporary tele- and videoconfer- 
encing systems. Such a localization considerably influences the efficiency of the 
source acquisition, since it reduces influences of other sources on the chosen one, 
improves the signal-to-noise ratio and in result - efficiency of noise reduction- & der- 
everberation algorithms. Furthermore, due to sound source tracking it is feasible to 
change automatically video camera direction during a conference, which improves the 
technical support and organization of the conference. 

Artificial neural networks are commonly applied in many areas of engineering and 
audio signal processing [6] [7], since they are capable to process uncertain informa- 
tion. They have already been applied to purposes of sound localization [8], however 
these attempts were based on feed- forward structures [28]. The feed- forward networks 
do not offer such feasibility as recurrent ones do, especially in the field of time series 
modeling [9] or mapping of a complex process dynamics [4] [10] [20] [21]. As is 
mentioned in par. 2, the localization theories explain human perception of directivity 
basing on temporal relationships. Therefore the question stands whether recurrent 
neural networks (RNN) can serve such a task as localization of sound sources. 

In this work the focus is put on a general RNN proposed by Elman [10], despite 
that there is a number of other recurrent architectures. An important class among 
RNNs constitute so called NARX networks (Nonlinear Auto Regressive with exoge- 
nous inputs) [17] [18]. They are reported to be robust, more straightforward to con- 
verge during the training than general RNNs [14], and to be equivalent to a certain 
extent to general RNNs [22] [23]. However due to their structure definition (only a 
single feedback loop between the output and input with a number delay taps in the 
loop) they seem to be unsuitable for the purposes of the sound localization. 
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Another issue concerns a choice of training algorithm. Although a number of vari- 
ous methods for training of RNN have been proposed so far [12] [20], including sec- 
ond-order methods [11], Conjugate Gradient Learning [4] and even genetic algorithms 
[20], the authors have focused on the standard approach introduced by Wiliams and 
Zipser [25]. 

The reason for such a choice of the structure and the training method is related to 
the planned systematical investigation of the various methods for sound source local- 
ization and their applications to real videoconferencing. From this point of view, this 
paper describes preliminary experiments and reflects some initial states of the long- 
term research. 



2 Sound Localization - Problem Statement 

Despite the great development of science in the field of human perception, issues 
related to sound localization are not thoroughly known and phenomena underlying 
thereof are still the subject of intense research [2] [13]. Actually, perception of sound 
directivity by the human binaural system is based on the following two principal enti- 
ties [13]: 

Interaura I Level Difference (ILD); difference of intensities of waveforms in the left 
and in right ears. 

Interaura I Time Difference (ITD); difference of arrival times of relevant waveforms 
in the both ears, which is equi valent to a phase difference of the waveforms. 

In the filed of digital signal processing, sound source localization can be performed 
by means of a microphone array which can be either linear or non-linear [16] [19]. 
Under the ideal conditions, the signal x/(/) received from the /-th microphone in the 
linear array and in the t~th moment of time can be described as follows: 

Xj (t) = ai • s{t - (i - 1) • t) , (1) 



where: a/ - attenuation coefficient for the /-th microphone, 

- source signal, 

T - time delay between adjoining microphones, 

and the estimation of the source location i s a deterministic problem. 

However, under real conditions there occurs various distortions and interfering sig- 
nals such as: background noise and reverberated signals. Then, the signals received by 
a linear microphone array are expressed by the relationships: 
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xi (t) -a\'hi (/) ^ s{t) + fii (/) 

X 2 (0 = a 2 • //2 (0 “ ^) + ^2 (0 

Xi (0 = Ui • hi it) * s{t - (i - 1) • T) + «,• (0 



( 2 ) 



where: hj(t) - impulse response of the reverberant channel associated with the /-th 
microphone, 

rij (t) “ ambient noise received by the /-th microphone. 

These conditions make the problem of sound source localization more complex, 
and therefore a number of various methods have been proposed. Most of them are 
based on estimation of the time delay including cross-correlation techniques [3], 
adaptive filtration [5] or computation of relevant eigenvalue vectors and matrices [1]. 
In turn, in the case of tracking or localization of a number of sources, the Maximum 
Likelihood-based methods are exploited [27]. More details can be found in the abun- 
dant literature on the localization of acoustic sources for multimedia applications [15] 
[16] [19] [24] [26]. 



3 Neural System for Sound Localization 

The proposed neural system for sound localization consists of L equally spaced mi- 
crophones (forming a linear array) connected to a recurrent neural network which 
architecture is shown in Fig. 1. The received and discrete values of waveforms coming 
from ail L microphones are normalized and fed into L input units of the network. In a 
given moment of time /, the received signals can be described by the Z-dimensional 

measurement vector u{t)-\ui{t) ui(t) The input data for the 

neural network in the moment t compose of a train of 2/ such Z-tuple vectors, which 
can be treated as the N~th dimensional measurement matrix 
U(t) = [u(t) u(t-n) u(t-N + l)]. 



3,1 Structure of the Recurrent Neural Network 

The structure of the exploited neural network is presented in Fig. 1. The network is an 
extended version of the general recurrent neural network proposed by Elman [10], and 
is composed of: 

- the Input Laym% consisting of + 1 units including a bias, 

- the Hidden Laym% consisting of M neurons. 
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- the Context Layer, consisting of M units which outputs are delayed by a single 
cycle with regard to those in the hidden layer, 

- the Output Layer, consisting of K neurons 

The input vector (matrix) jc(0 consists of the measurement matrix U{t) aug- 
mented by the fixed value -1, which represents signals received by the L microphones 
over the last N time units. 




Fig. 1. Architecture of the extended globally recurrent neural network 



3.2 Training of the Recurrent Neural Network 

In general case, the training of such a network as presented in Fig. 1 is based on the 
relationships introduced by Wiliams and Zipser [25]. In order to simplify denotations, 
the Input Layer formed in the rectangular array (as in Fig. 1) can be “unfold”, and 
hence can be referred to as a linear array of = AxL+1 . In result, the input vector 
can be described as follows: 



x{t) = [u^{t) ■■■ u^it) ■■■ uiit-N + 1) ••• Uiit-N + I) -if, 

Assuming that the error measure in the ^th moment of discrete time is the mean- 
square error as below: 
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1 ^ 



(4) 



k=\ 



where: dj^{t) - the desired response of the k~th output neuron, 

0]^{t) “ the current output of thi s neural unit, 

it can be shown that the update expressions for particular weights and at given learn" 
ing rate rj are computed in the following way: 



km (0 = • [dk (0 - Ok (?)]• fk {netZk (/))• Vm (0 



(5) 



K 



M 



^^k (0 - Ok (Ol- n {netZk (O)- X ^kj^) ' RLi ■ 

k=l 7=1 



( 6 ) 



K M 

^ ^mn (0 ~ ^ ' X \<^k (0 - (0]- fk inetZk (0)- X ^kj (0 • SL (0 , 



k=\ 



7=1 



where: - the derivative of the activation function for the ^-th neuron, 

(/) “ the output of the m~ih hidden neural unit. 



(7) 



and the auxiliary terms (t) and (t) are defined as follows: 

' M 



Ri,it) = fj(netVWj{t))- 






i=l 



( 8 ) 



Si„{t) = fj(netVWj{t))- 



M 






i=l 



(9) 



where denotes the Kronecker’s delta, and the weighted sums: netZj^(t) and 
netVWyyj (t) are calculated as below: 



M N\ ^ 

netVW^d) = Yj^„j(t) -yjit - 1) + ■ xj{t) ; (0 = ^ ^kj (0 ' Zj (0 (5) 
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4 Experiments 

The objectives of the planned experiments concern design and implementation of 
various methods for sound source localization, their mutual comparison and verifica- 
tion under real conditions. Therefore the long-term experiments are divided into a 
number of surveys, including a number of recordings made in: 

- acoustically chambers adopted (anechoic chamber-like conditions), 

- different real conference rooms, 

- including many speakers, 

- with various distributions of participants 

The objectives of the preliminary experiments concern design, implementation and 
tuning of relevant methods and the comparison of their efficiency with these of the 
standard ones. Therefore some simplifications of real conditions have been intro- 
duced. 



4,1 Organization and Conditions of Experiments 

The preliminary experiments have been carried out in a chamber acoustically adopted, 
i.e. all chairs and necessary stuff had been removed and curtains had been fixed at the 
doors, windows and other reflecting planes. The microphone array consisted of 4 
electret microphones spaced by 5 cm, and was fixed 1 .58 m from the floor and 1 .25 m 
from the ceiling. Four mono tracks were recorded simultaneously using signals re- 
ceived from these microphones. 

The recording parameters were as follows: 16bit/sample and the sampling fre- 
quency was equal to 8kHz and 48 kHz. There was one male speaker, distanced 1.5 m 
from the array. The speaker read a logatom list from the angles differing by 1^. In 
result, 24 four-track recordings have been made, and every recording lasted approx. 
50 s. 



4,2 Surveys and Results of Experiments 

For the preliminary experiments, the logatom lists were reduced and their number was 
limited to 7, which represented the sound directivity from -45^ to +45^ every 15^ (7 
classes). The structure of the RNN was as follows. The number of time units ranged 
from 3 to 5, which for L = A microphones yielded Ni g (13,17,21) input data. The 
input vectors formed sequences of various lengths (2, 3, 4 and 5). In turn, the number 
of hidden neurons was arbitrary set to 10 or 15, whereas the output layer consisted of 
7 neurons. The activation function was set as the unipolar continuous function. 

The training and testing vectors and sequences were different and selected ran- 
domly. The number of the vectors per class was equal to 100, which yielded totally 
700 vectors for a training- and testing phase. In order to obtain statistically valid re- 
sults, computations were repeated 10 times per a given survey. 
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The accuracy of the right direction detection ranged from 73 % up to 82 %. An in- 
teresting occurrence was observed: for too long input vectors the efficiency decreased. 
In the latter case it can be interpreted that the network began rather to approximate the 
waveforms themselves than their mutual relationships, which are essential for sound 
localization. 



5 Conclusions 

In this paper, the new algorithm, based on recurrent neural networks, for the estima- 
tion of sound source location has been proposed and described. Moreover, the scheme 
of experiments and some results of them have been included. The results of automatic 
discrimination of sound source direction obtained with the use of the implemented 
algorithm are promising. More experiments are planned which task is to compare 
obtained results with scores possible to get in the same acoustic conditions with some 
hitherto existing algorithms for the automatic sound source tracking. 
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Abstract. This paper introduces a neural network architecture based on rough 
sets and rough membership functions. The neurons of such networks instantiate 
approximate reasoning in assessing knowledge gleaned from input data. Each 
neuron constructs upper and lower approximations as an aid to classifying 
inputs. Rough neuron output has various forms. In this paper, rough neuron 
output results from the application of a rough membership function. A brief 
introduction to the basic concepts underlying rough membership neural 
networks is given. An application of rough neural computing is briefly 
considered in classifying the waveforms of power system faults. Experimental 
results with rough neural classification of waveforms are also given. 



1 Introduction 



A form of rough neural computing based on based on rough sets, rough membership 
functions, and decision rules is introduced in this paper. Rough sets were introduced 
by Pawlak [1], and elaborated in [2] -[3]. Rough membership functions were 

introduced by Pawlak and Skowron [4]. Studies of neural networks in context of 
rough sets are extensive [5]-[12]. This paper considers the design and application of 
neural networks with two types of rough neurons: approximation neurons and decider 
neurons. The term rough neuron was introduced in 1996 [5]. In its original form, a 
rough neuron was defined relative to upper and lower bounds and inputs were 
assessed relative to boundary values. More recent work considers rough neural 
networks (rNNs) with neurons, which construct rough sets and output the degree of 
accuracy of an approximation [10]-[1 1], which is based on an earlier study [9]. The 
study of rough neurons is part of a growing number of papers on neural networks 
based on rough sets. Rough-fuzzy multilayer perceptrons (MLPs) in knowledge 
encoding and classification were introduced in [12]. Rough-fuzzy neural networks 
have recently been also used in classifying the waveforms of power system faults 
[10]. Purely rough membership function neural networks (rmfNNs) were introduced 
in [11] in the context of rough sets and the recent introduction of rough membership 
functions [4]. This paper considers the design of rough neural networks based on 
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rough membership funetions, and henee this form of network is ealled a rough 
membership neural network (rmNN). Preliminary eomputations in a rmNN are 
earried out with a layer of approximation neurons, whieh eonstruet rough sets and 
where the output of eaeh approximation neuron is eomputed with a rough 
membership funetion. The values produeed by a layer of approximation neurons are 
used to eonstruet a eondition veetor. Eaeh new eondition veetor provides a stimulus 
for a deeider neuron in the output layer of a rmNN. A deeider neuron enforees rules 
derived from deeision tables based on rough set theory. A deeision table refleets our 
knowledge of the world at a given time. This knowledge is represented by eondition 
veetors and eorresponding deeisions. Information granules in the form of rules are 
extraeted from deeision tables using rough set methods. Diseovery of decider neuron 
rules stems from an application of the rule derivation method given in [13]-[14]. 
This characterization of a decider neuron is based on the identification of information 
granules based on decision rules [15]. Each time a decider neuron is stimulated by a 
new condition vector constructed by the approximation neuron layer, it searches for 
the closest fit between each new condition vector and existing condition vectors 
extracted from a decision table. Decider neurons are akin to what are known as logic 
neurons described in [16]. 



2 Rough Membership Functions 

A brief introduction to the basic concepts underlying the construction of rough 
membership neural networks is given in this section. A rough membership function 
(rm function) makes it possible to measure the degree that any specified object with 

given attribute values belongs to a given set X [4], [21]. A rm function is defined 

relative to a set of attributes B A in information system S ^ (U, A) and a given set 

of objects X. The equivalence class [x]b induces a partition of the universe. Let 

B A, and let X be a set of observations of interest. The degree of overlap between 

X and [x]b containing x can be quantified with the rough membership function: 

R R [-^Ir n a 

[0, 1] defined by " (x) = ^ 



3 Design of Rough Neural Networks 

Neural networks are collections of massively parallel computation units called 
neurons. A neuron is a processing element in a neural network. 

3.1 Design of Rough Neurons 

Typically, a neuron y maps its weighted inputs from R“ to [0, 1] [16]. Let T be a 
decision table (X, A, {d}) used to construct BX , BX , and let X c Y. A selection of 
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different types of neurons is given in Table 1: eommon neurons, approximation 
neurons and deeider neurons. 



Table 1. Selection of Different Types of Neurons 



Common Neuron 


Upper Approximation neuron 


f n \ 

y ^ f ^ W-X- + , where 

V ) 

input Xi has eonneetion (weight) 
Wi, whieh denotes a modifiable 
neural eonneetion, and bias fr [16] 


y, = f(BX,BX,X) 


Lower approximation Neuron 


y = f(BX,X) 


Decider Neuron 


Ymie = min(ei,di([^x‘ (y)])), 

with eondition granule 

(x)] 



Let B, F, [ f ]b denote set of attributes, set of neuron inputs (stimuli), and 
equivalenee elass eontaining measurements derived from known objeets, respeetively. 
The basie eomputation steps performed by an approximation neuron are refleeted in 
the flow graph in Fig. 1 . 




Fig. 1. Flow Graph for Basic Approximation Neuron Computation 

An approximation neuron measures the degree of overlap of a set [ f ]b and 
BF representing eertain as well as uneertain elassifieations of input signals. A flow 
graph showing the basie eomputations performed by a deeider neuron is given in Fig. 
2 , 




Fig. 2. Flow Graph for Deeider Neuron 



A deeider neuron implements a seleetRule algorithm. 
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Algorithm selectRule { 

input set {a^>P| //set of deeision rules 

input veetor [eexpb Cexp 2 , Cexpn]; //eondition veetor input { (/) } 

int ehosenRule; //index used to identify deeision rule 
float[ ] sum; //stores sum of differenees | Oexpj- Cij | 
float bestMateh; //used to store value of best mateh 
int veetorSize ^ 2, i ^ 1 , j ^ 1 ; 

n 

bestMateh ^ ^l^expy | ^ ehosenRule ^ 1; //for veetor ai 
while (veetorSize n) { 

n 

sum[vectorSize] = “bi’ 

7=1 

if (sum[veetorSize] < bestMateh) { ehosenRule ^ veetorSize } 
veetorSize++; i++; // use i to seleet eondition veetor 

j ^ 1; 

(//while 

return ehosenRule; 

} // end Algorithm seleetRule 

In Fig. 2, the set rmf ^ { jlp (/^) } eonsists of approximation neuron 

measurements in response to the stimulus provided a new objeet requiring 
elassifieation. The elements of the set rmf are used by a deeider neuron to eonstruet 
an experimental eondition veetor aexp- A seeond input to a deeider neuron is the set 
R ^ {a^>p}. The elements of the set R are rules whieh have been derived from a 
deeision table using rough set theory. After a deeision rules has been seleeted, a 
deeider neuron outputs min(ei, di) where d e {0,1}? and relative error ei ^ | Oexp - Ci |/ei 
e [0,1]. In eases where d ^ 0, then y^je ^ min(ei, di) ^ 0, and the elassifieation is 
unsueeessful. If d ^ 1, then y^ie ^ min(ei, di) ^ ei indieates the relative error in a 
sueeessful elassifieation. 



3.2 Rough Neural Network Example 

By way of illustration, a rough neural network is eonstrueted with two layers: input 
layer eonsisting of upper approximation neurons, and output layer with a single 
deeider neuron (see Fig. 3). Using a sample of 61 fault files, a partial deeision table 
has been eonstrueted (see Table 2). Let v, i denote voltage, eurrent, respeetively. To 
eomplete the design of a deeider neuron, rules are extraeted from deeision Table 2. 
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Table 2. Sample Power System Commutation Fault Decision Table 



ac v error 
(al) 


phase i / i 
order (a2) 


pole line v 
(a3) 


6 pulse 
(a4) 


phase i type 
(a5) 


phase i ord 
(a6) 


max phase 
i(a7) 


d 


0.0588 


0 


0.5 


0 


0 


0 


0 


0 


0.0588 


0.06977 


0.5 


1 


0.1875 


0.05405 


0.08571 


1 


0.0588 


0.06977 


0.5 


1 


0.1875 


0.05405 


0.08571 


1 



A sample of the rules derived from Table 2 using Rosetta [22] are as follows. 

al (0.058824) AND a5(0. 187500) AND a7(0.000000) -> d(O.OO) 
al(0.058824) AND a5(0.187500) AND a7(0.085714) -> d(l.OO) 

Rules like those given above are ineorporated in a deeider neuron repository 
(storage of rules assoeiated with a deeider neuron). In the experiments deseribed in 
this seetion, eaeh approximation neuron is defined relative to a single attribute sueh as 
AC disturbanee (see al in Fig. 3). The deeider neuron in Fig. 3 implements the 
seleetRule algorithm to produee its output. The design of a partieular deeider neuron 
hinges on derivation of rules from a deeision table re fleeting our eurrent knowledge of 
the world. 




Fig. 3. Sample rough neural network 

In the eonstrueted networks, the weights are not primitive but they are funetions of 
some other parameters like set of features (attributers). The relationships between 
the weight values and these other parameters are expressed in the paper by the rough 
membership funetion. This frinetion allows to measure a degree in whieh B- 
indiseemibility elasses are ineluded in a given set (in the eonsidered example in the 
upper approximation of one of the deeision elass). Henee, the proeess of tuning 
weights in the network should be eonneeted with tuning of parameters on whieh these 
weights depend. In partieular, in the eonsidered example this ean be related to 
searehing for relevant feature set B of attributes. 
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3.3 Sample Verification 

A comparison between the output from a rough neural network used to classify power 
system faults relative to 24 fault files and known classification of the sample fault 
data is given in Fig. 4. 

1 

0.9 
0.8 
0.7 
0.6 
0.5 
0.4 
0.3 
0.2 
0.1 
0 

0 5 10 15 20 25 

Fig. 4. Comparison of Rough Neural Network Output and Target Values 

In all of the cases considered in Fig. 4, there is a close match between the target faults 
and the faults identified the neural network. Further, it should be observed that a total 
of eleven neural networks were used (one for each type of fault file) to generate the 
data used in Fig. 4, and carry out a complete classification of all fault files. 




4 Concluding Remarks 

Two basic types of rough neurons have been identified: approximation neurons and 
rule -based neurons. The output of an approximation neuron is a rm function value, 
which indicates the degree of overlap between an approximation region and some 
other set of interest in a classification effort. The output of rule -based neuron is a 
classification decision, which represents an assessment of the closeness of 
experimental data to a known feature of a feature space. A sample application of 
these neurons in a power system fault classification system has been given. We 
consider the problem of learning schemes of information granule construction. These 
schemes transform input granules into output ones. It is necessary to tune parameters 
in these schemes to obtain the output granules of satisfactory quality from input 
granules. One of the method of tuning these parameters can be based on finding 
function embedding these schemes into classical neural networks. Next known 
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learning methods for neural networks ean be applied. The weight values in sueh 
networks will refleet the inelusion degrees between granules. Henee the proeess of 
ehanging weights in neural networks should eorrespond to tuning degrees of granule 
inelusion. The paper presents an example of sueh situation. 
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Abstract. Crossover is a main searehing operator of genetie algorithms 
(GAs), whieh has distinguished GAs from many other algorithms. Through 
analyzing and imitating the implementation of erossover operator, this paper 
points out that erossover is intrinsieally a heuristie mutation with referenee. 
Its referenee objeetive is just the other individual whieh is mated with the 
one whieh will be erossovered. On the basis of this eonelusion this paper 
then explains and diseusses the results obtained by other GA researehers 
through experiments. 



1 Introduction 

Genetic Algorithm (GA) was invented by John Holland in the 1960s, a professor 
of Michigan University. Holland’s original goal was not to design algorithms to 
solve specific problems, but rather to formally study the phenomenon of adaptation 
as it occurs in nature and to develop ways in which the mechanisms of natural 
adaptation might be imported into computer systems^^^^. However, after it was later 
developed by his students, colleagues and other researchers. Genetic Algorithm 
has been widely used in various problems as a robust search method, especially in 
optimum seeking^^’^^. 

As an important branch of Evolutionary Computation (EC), Genetic Algorithm 
(GA) is characterized by its current effectiveness, strong robustness, and simple 
implementation. It also has the advantage of not being restrained by certain 
restrictive factors of search space. Due to the advantages mentioned above, 
researchers have been showing increasing interest in GA. It has been applied 
successfully to many fields such as machine learning, engineering optimization, 
economy forecast, automatic programming, and so forth. Nowadays GA has 
become a very popular research subject in many branches of science^^"^^^. 

Crossover operator is the main search operator of Genetic Algorithm, one of 
the most important features distinguishing it ftom other search algorithms. The 
effect of crossover operator in Genetic Algorithm is a disputed problem in GA 
field for a long time. Standard Genetic Algorithm and most other improved 
Genetic Algorithms adopt crossover operator as their main genetic operator^^'^^. 
Evolutionary Strategy only utilizes mutation operator in its early development and 
mutation operator is still its most important operator^^^, although it later combines 
crossover operator. Evolutionary Programming only employs mutation operator 
and does not employ crossover operator at all.^^^^ 

D. Fogel, one of the Evolutionary Programming advocates, declared that 
crossover is not superior to mutation^^^^. Scheffer drew the conclusion by his 
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experiment that crossover is not always sufficient with only mutation Based 
on his theoretical analysis and experimental test on the model division and 
construction of mutation and crossover, Spears, an Evolutionary computation 
researcher in Artificial Intelligence Center of American Navy Laboratory, pointed 
out that mutation is more divisive than crossover and crossover is more 
constructive than mutation. One complements the other and both of them are 
indispensable^^^^. 

So far, the study about crossover and mutation is only based on the macro effect 
of the two operators. This paper will simulate and analyze the performance 
process of crossover operator to discover the microcosmic nature of crossover 
operator. 



2 The Nature of Crossover Operator 

Crossover operator completes its performance by exchanging two selected parent 
genes, including point crossover and uniform crossover, to create two new 
individuals that respectively inherit some genes of their parents. The following 
will illustrate two-point crossover and uniform crossover with binary coding to 
simulate and analyze the crossover operator performance. 

Suppose the two selected parents are: 

<§^{0,1}, /=1(1)L 

J?=J?lJ?2--J?L J?P{0,1}, /=1(1)L 

where L is the length of chromosome bit string. According to ^ and 77 , we 
construct a new individual |Li: 

//,= <§,XOR 77 , , /=1(1)L 
Obviously the following equation should be true: 

where H((§,77) represents Hamming distance between ^ and 77 , and therefore |Li is 
called Hamming distance of ^ and 77 . 

2.1 Two-Point Crossover 

With regard to two-point crossover, suppose the two selected crossover points are 
Cl and C 2 (Cl < 02 ), and the new individuals generated by crossover operator are 
respectively: 
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Now use Hamming individual |Li as reference, we mutate <§. as follows: 
for i=\ to Ci-1 
t ( i )= \ i, 
for /=Ci to C2 
if (/t, ==0) 
then t( /■)=<§/ 
else t( i )=~ ^ f, 
for z=C2+l to L 
t ( i )= k i- 

The nature of the above mutation is: 

It remains the same gene value between 1 and Ci -1 and between C2+I and L. 
While between Ci and C2, if the gene value of |Li is 0, it remains the loci, otherwise 
it inverses. That is, if the gene value of the corresponding loci is the same, it 
remains the gene value of the corresponding loci, otherwise it inverses. In other 
words, the transformation only exchanges the gene value between Ci and C2 
of and 77. Therefore, the individual t created by the above mutation is the 
same as t - Similarly, rjj =rjj. 

2.2 Uniform Crossover 

With uniform crossover, suppose the created crossover template is v = V1V2 ...Vl, 
the two new individuals generated by uniform crossover are: 

<§U=Cl If V;=0 CHi; Ifv,=l /=1(1)L 

J?u=TiJ?^ 2 ---J?\ lfv,=0 Ifv,=l = /=1(1)L 

Now use Hamming individual ji and uniform crossover v as reference, we 
mutate ^ as follows: 

for z=l to L 
if(v,=0) 
then <§ u( /)=<§, 
else if {jjii ==0) 

then <§ 'u( /)=<§, 
else <§u(0=~^; 



The nature of the above mutation is: 

If the corresponding gene value of the created crossover template at a locus is 0 , 
then the gene of the individual at the same locus remain the same as the original 
one. If the corresponding gene value of the created crossover template is 1 , then 
inverse the gene. That is, , if /i/= 0 , then remain the gene value. If/i/=l, then 
inverse it. So the mutation has the same effect of crossover operator. Thus, ^ \j=^ 
u . Similarly, we can obtain ?7u= 

2.3 Analysis 

With Hamming individual |Li as reference objective, the two mutation processes 
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respectively achieve two-point and uniform crossover operation of and r\ and get 
the same result as normal crossover operator. In fact, any crossover operation 
defined on any alphabet can be made through corresponding mutation according 
to the media individual created by two parents. Therefore, we induce the 
following theorem: Crossover is intrinsically a heuristic mutation with reference, 
and the reference objective is just the other individual which is mated with the one 
which will be crossovered. 



3 Discussion 

The above conclusion may be used to explain the function and effect of crossover 
operator in evolutionary progress and help design more general operators with 
mutation and crossover effect at the same time. 

• Because crossover operator does not alter the same genes of the two parents, 
and only changes different genes, “mutation” is defined in the different genes of 
two individuals and so it may produce certain heuristic effect. The inferior 
individual gets a certain part of superior individual gene, and may improve its 
fitness more probably. If the exchanged gene of superior individual is negative to 
the former individual because of gene epistasis, the fitness of superior individual is 
ameliorated to some extent. This explains the exploitation effect of crossover 
operator. 

• Hamming distance between two offspring created by the two individuals 
with Hamming distance d and their parents is between 0 and d. According to the 
similar extent of two parents and difference of crossover points, crossover 
operation may generate more mutative effect than normal mutation (0<mutation 
gene numbered). This is why global search of crossover operator is superior. 

• Because the nature of crossover operator is the heuristic mutation with 
reference, theoretically mutation can implement all what crossover operator can do. 
That is, it should be sufficient with only mutation. Because single aimless 
mutation lacks the referential heuristics of crossover operator, “it is not always 
sufficient with only mutation”. 

• Two parents with their Hamming distance d at most create Max {2d - 2, 0} 
by single crossover , at most d^-d new individuals by two-point crossover, and at 
most create Max (2^ - 2,0} new individuals (the most individuals that crossover 
can generate). This is just the reason why two-point crossover is more 
constructive than single point crossover and less constructive than uniform 
crossover. 

• Because crossover operator can not alter the same genes of two parents, the 
capacity of two individuals with Hamming distance d is 2^, while the genetic space 
capacity of the whole problem is 2^. Therefore, when the population diversity 
lacks. Genetic Algorithm performance, which only adopts crossover operator or 
search operator whose main operator is crossover operator, is greatly influenced. 

• Because crossover operator only mutates the different genes of two parents, 
the individuals who are very similar should avoid crossover operation. For low 
fitness individuals usually lack referential heuristics, two low fitness individuals 
should avoid crossover operation too. 
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4 Conclusion 

This paper simulates and analyzes the performance process of crossover operator 
and draws the conclusion that the nature of crossover operator is the heuristie 
mutation with reference. Unlike normal mutation operator, the operator only 
mutates the different genes of two parents, it can achieve certain referential 
heuristics that normal operator can not do. Genetic Algorithms and even 
Evolutionary Computation in nature just transform and select a group of feasible 
individuals. The more heuristie transformation is, the more efficient algorithm 
search is. It is not necessary to restrain Genetic Algorithms or Evolutionary 
Computation to the simulation of some principles or processes of evolutionism and 
biology genetics. Research should focus on how to improve the heuristics of 
feasible transformation and rationality of selection, although it does not exclude 
the reference to the results from relative science. 
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Abstract: In this paper, we introduce an Integral Neural Network based 
on complex domain. We describe the model of the neuron, analyze the 
behavior of the neuron, and indicate that in certain conditions it 
performs the calculation of Fourier Integration. Have studied on the 
neural network with hidden layers, we obtain the following facts; 1. 
This kind of structure can memorize a time variant function; 2. It 
calculates the convolution of input series and the function the neural 
network memorized; 3. This neural network structure also can calculate 
the correlation function; 4. In the case of many hidden layers, it can 
perform the Fourier Transform with many variants. 



1. Introduction 

As early in 1975, researchers had already extended the conventional LMS 
algorithm to the complex domain[5]. Until now, the papers published on complex 
domain neural network are not so much. The majority of those papers think of 
complex number as a pair of real numbers[9]. The studies in complex domain NN 
mainly focused on complex BP Neural Network and complex associative memory. 

In the field of complex BP Neural Network, T. Nitta has already done many 
works[2]. His main results include the followings: It is not suitable to choose an 
holomorphic function as the activation function. The learning speed of the complex 
BP algorithm is faster than that of real-BP. The complex-BP network can learn the 
basic geometric transformation (rotation, similarity transformation and parallel 
displacement etc.) There are several works extends the complex domain neural 
networks to the associative memory[l][3][4]. The ways to study complex associative 
memory is similar to those to study real associative memory, except that the data 
range is extended to complex domain [6]. 

There also have some reports on Neural Networks for Time Series Processing. 
The basic idea of introducing the time variant function to neural network is based on 
the fact that the world is always changing. Whatever we observe or measure have 
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different values at different points of time[10]. The NN mechanism for handling time 
variant sequences is much like that of finite automatons. It has internal status and 
feedback structures [II]. 

In this paper we analyze the mathematical function of the complex integral 
neural net work [7] [ 8 ]. We introduce the neuron model and the neuron behavior in 
section 2 and section 3; In section 4, we study the function of the neural network with 
one hidden layer; In section 5, we study the cases which have many hidden layers; 
And finally, we conclude the paper. 



2. The Integral Neuron Model Based on Complex Domain 

Figure 1 shows the structure of a complex integral neuron. It is based on 
complex domain, so its weights, input and output signals are all taken from complex 
domain. Besides, the model has two other features. One is that the inputs and outputs 
of the neuron are all time series rather than some fixed numbers; Another is that the 
output signals feedback to itself. The detailed process is as follows: 

When t = 0, the input signals are fij(O), f 2 i( 0 ), . f^iCO). They are multiplied with 
the weights Z^, Z^, Z^^^ respectively, then calculate the sum then we 

get the output 4(0). The relationship between 4(0) and L(4ni(0y^Zj^) depend on the 
activate function F(^). At the same time, the output 4(0) multiplies the feed-back 
weight Z^^i and feeds back to the neuron as a new input signal at t = 1 . So, when t = 
f, the input signals are o( 0 )- The summation is 

E(4^i(l)*Zj,,)+ 4(0)^Z„^^i instead of E(4„(1)*Z„^). After calculate the activate function 
F(), we get the output 4(1), and so on and so forth. Generally, the output at t = n 
depends on the inputs at t=n ( 4 j(n), f 2 i(n), . frni(n)) and the output at t=n-l ( 4 (n-l)). 




Though different activate function can be chosen, in the rest of the paper we 
always choose it as the identical function. 



3. The function of the neuron 

For simplifying the neuron model, we first consider the neuron, which only has 
one input end (See figure 2). 

In figure 2, the input series of the neuron is 4(t). The output series is 4(t). The 
input weight and feedback weight are z and i respectively. Suppose 4 . 1 , 4 are two 
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adjacent time points, we get: 

f„ (t„) = t'i (tj ■ z + f„ (t„.i> i (1) 

where z and ’ are complex numbers, |(t), :g(t) are complex series. The 
following theorem shows the function of this model. 




Figure 2, The neuron which only has one input end. 

Theorem 1 . Suppose ± = exp(-ico) and z= 1 , then the output f„ (t„) is the value of the 
Discrete Fourier Transform of the series ^ (tg), (t^), ^ (t„) at the frequency co, 

where n is the number of the sampling points. 

Proof: Let fj(t)= f„ when t = n. We first proof 

/,(«) = £/,-„ -(zT (2) 

m = 0 

It is obviously when n = 0 and n = 1. 

Suppose it is true when n = k-1, that is 

/. »-l) = £/»-,-. -feT 

m=0 

Then 

m=0 

= A-_„ • (z')“ + ft-m • iz' T replace m + \bym 

m=l 

= £/.-, -a')" 

m=0 

So the formula (2) holds true, that means the output is the part sum of the 
unilateral Z transform based on the reversed series {f„}. If the time is enough long, 
the part sum is very close to the value of the Z transform of the input series at the 
point z = (z’ ) ^ In the case of Z = exp(-iCL)), the part sum 

Yfn-m-(zT 



equals to: 
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m=() 

That is the part sum of the discrete Fourier Transform of the input series. If the time 
is enough long, the part sum can be very close to the value of the Fourier Transform 
of the input series at certain frequency point co = Arg(z’ ). 

End of proof. 

The experhnental results [8] show that this kind of neuron can be used as a filter 
or an amplitude indicator. The following theorem also is interesting. 

Theorem 2. Suppose f(t) e V, and for all n e N, f(n) = ]f(n), F( co)=Jo°°f(t)e '®Mt, if the 
sampling frequency is large thancob, and ^ = exp(-icqj), then 



= lim 



fjn) 

nco^ 



Proof: Because f(t) g L\ the integral 
F(co)=ln“f(t)e '“^dt is convergent. 
The DFT form is: 



27T 

By the theorem 1 we get: 









^(®o) 



fain) = £ 

m = 0 

Ta kTn n 

= £ + ... + £ + £ /(m).e 

m-0 m — ( I ) 7’[j+ I m — kT^+\ 

= 2n ■ k ■ F + £ /(m) ■ 

m - kT^ + 1 

Let l^\) = n -kTo, and ih < To 

Because = 2 tt/Tu, and n = kT^ + n^, we get k~n/ T^ = ncq^/27r. 

When n^oo, k = nCQ^/271, so 



lim 

n—>oc 



fain) 

ncOo 



= lim 



m=kTQ+\ 

n(0o 



= lim< 

n—^oo 



ncop --FCcOq) 
n(Og 



m =k'1 0 + 1 

«COq 



■) 



= lim 

n—^oo 



nCOg-FjOig) 

nG>g 



(lim 



m =k'l'o +1 

nWo 



= i"(COo) 



= 0 ) 



End of proof 
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Let us consider the general situation of multi-input neuron (See figure 1). The 
output series should be: 



fo{t) = [ L ifki{t)Zk)-zln+ldt ( 3 ) 

/C=l 

From the formula (3) we can see the values of the weights of the neural inputs 
have different function compare to that of the self-feedback weight. The former gives 
each input signal an adjustment of phase and/or amplitude, while the latter determines 
the impulse response at some certain frequency point of the signals adjusted. Because 
we do not change the frequency value when adjusting the phase and/or amplitude, we 
can then amplify/diminish the original signal and prevent/facilitate the offset of two 
or more signals. 



4. Network with one hidden layer 



Function Storing 

For understanding what information is stored by the weight values, we let input 
series to be the Sfunction. The 5 function is defined as follows: 



If we put it as the input signals of the neuron on figure 2, we can easily see that 
the output series is z-exp(-inCO), where z’ = exp(-ico). That means the weights 
memorize the function z exp(-in CO). Combine a set of neuron as figure 3, the function 
they memorize are L(zi/exp(-inCOi.)). Because any function which satisfy Dirichlet 
condition can be represented as a Fourier series, It means that a set of hidden layer 
neuron can store a large set of functions. 

The interested fact is that the same set of neurons can memorize different 
functions in different channels or at different outputs. 




Figure 3. (a) A set of neurons stored different functions in different channels. (b) A set of neurons 
produce different functions at different outputs. 

In figure 3 (a), the set of neurons memorized different functions in different 
channels. That is: If we put the 6 function at f input end, the output series is 
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L(z,^-exp(-inCL)i-)); If we put the 6 function at input end , the output series is 
L(z 2 i,-exp(-inC 0 j^). In figure 3 (b), the set of neurons produce different functions at 
different outputs. That is: If we input the 6 function, the output series fo,(t) and fo 2 (t) 
are Z(zij.-exp(-inCOj^)) and L(z 2 i,*exp(-in 0 \)) respectively. 

Convolution 

The above study shows that the output would be the memorized function f(t) if 
the input function were 8 function. In many cases, the input function is any time 
series function g(t). What the output will be? The following theorem tells us which 
calculation the neurons take on functions f(t) and g(t). 

Theorem 3. Suppose f(t) is a function, which is memorized by a set of, neurons with 
the structure of figure 3, and g(t) is a input function. The output function is the 
convolution of f(t) and g(t): f^(s) =Zf(n)g(s-n). 

Proof: We first consider the neuron that figure 2 described. The function it stored is: 

g(n) =Z(z*exp(-inco)) 

We will prove 

f„(s) =J^f(n)g(s-n) =L(f(n)L(z-exp(-i (s-n)co))) (4) 

When s = 0 



f^(0) = f(0)(z) = f(0)g(0) (formula (4) is true) 

and s = 1 

fo( * ) = • )(z)+f(0)(z- z’ ) = f(0)g( I )+ f( 1 )g(0) (formula (4) is true) 

Suppose fo(k-l) =Lf(n)g(k-l-n) holds true, then 

f„(k) = f(k)(z)+f„(k-l)z’ 

= f(k)g(0)+(i:f(n)g(k-l-n)).z’ 

= f(k)g(0)+ Lf(n)g(k-l-n>z’ 

= Lf(n)g(k-n) (forg(k-l -n>zf isg(k-n)) 

So for all se N, formula (4) is true. For the NN described in the figure 3, 
g(n) =L(z^-exp(-i no^^^) =Lg^{n). 

The output is: 

4(s) = L{i:f(n)g,(s-n)} =Z{f(n)( i:g,(s-n))} = Lf(n)g(s-n) 

End of proof 

Theorem 3 shows that the integral NN with one hidden layer neurons can 
memorize a time variant function f(t), which, together with the input series g(t), 
produce the convolution series of the functions f(t) and g(t). Applying the time 
shifting property and conjugate property [7], this structure can also calculate the value 
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of the correlation function series. This is because: 

Cor(f,g)(x) = |f(t)g(t+T)dt = |f(t)g*(-t-T)dt = (f*g)(-x), and we can get g*(t) from g(t) by 
replace z^, Zg, with z"',, z% --^zV 

5. The behavior of the Integral NN with many hidden layers 

In section 3, we indicated that a single neuron could perform Fourier Transform. 
Here we will illustrate that the many hidden layers NN can perform the Fourier 
Transform with many variants. We just give the example of two hidden layers as it is 
drawn in figure 4. 



first second 

middle layer middle layer 




Figure 4. The NN with two hidden layers 

Theorem 4. Suppose the feedback weights of first hidden layer neurons and second 
hidden layer neurons are exp(-incoi) and exp(-in\j/j^ respectively, where 1=1,2, ..,m 
andk=l ,2, ..,n. The output of neuron of second layer will be F( C0|,\j/]^), if Wu= 1 , 
and Wjt= 0 for all 1, where F((D,\|/) are two variants DFT of input series f(x,y): 
and the inputs series of first hidden layer are fi(x,l), 1=1,2, . .,m. 

^20 ^10 

F(co,v) = £(£/,(?,„ 

f ^=0 f ^=0 

Proof: By the theorem 1 , the output series from neuron of the first layer is: 

y„ (fic) = E - hJ)' 

1^1 = 0 

So, the output of kth neuron of second layer is: 

f, = 0 (=1 fi = 0 

^■20 ''lO 

r 2 = 0 /i=U 

= £ £ /,(^o - fi-o 

/, = n /j = 0 

= F (CO , ,\i; J 



only Wfj^ 
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End of proof 



6. Conclusion 

In this paper, we introduce an Integral Neural Network based on complex 
domain. It is based on complex domain and the inputs and outputs are all time series. 
We describe the model of this kind of neuron, analyze the behavior of a single neuron, 
and indicate that in certain conditions it performs the calculation of Fourier 
Integration. 

In the case of the NN with one hidden layer, the weights memorize a time series 
function. If the input function is 5 function, the memorized function f(t) will be 
retrieved at the output. If the input function is g(t), what we obtain from output is the 
convolution series of the stored function f(t) and the input function g(t). 

In the case of the NN with many hidden layers, the outputs are multi-variant 
DFT in certain conditions. 

With the above features, this kind of neural network is useful in many 
application fields. 
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Abstract. The trace assertion method is a formal state machine based 
method for specifying module interfaces [1,9]* If can be seen as an al- 
ternative to algebraic specihcation technique. We extend the sequential 
model presented in [9] by allowing simple concurrency. 



1 Introduction 

It is a well established fact that state machine (not necessary hnite) mod- 
els and algebraic models are equivalent ([2,4]). This relationship differs 
for different machines and algebras, but the general idea of relationship 
may be illustrated as follows: 

8{p^a) — q a{p) — g, 

state machine algebra 

where 5 is a transition function of a state machine with a as a function 
name, and a[p) is a function named a applied to p. 

Very often automata models are better suited for specifying and ana- 
lyzing concrete software systems, while algebraic models are better suited 
for dehning more abstract and general theories. This is exactly the case for 
algebraic specihcation versus trace assertion method (see [9]). The trace 
assertion method was hrst formulated by Bartussek and Parnas in [1], as 
a possible answer for some problems with algebraic specihcations [3, 15], 
like specifying a bounded stack (bounded modules in general). It also can 
avoid the problem of overspecihcation in model-oriented specihcations, 
e.g. [9]. Since its introduction the method has undergone many modihca- 
tions [5, 11,9]. In recent years, there has been an increased interest in the 
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trace assertion method [9,14]. Despite many important industry appli- 
cations, solid mathematical foundations of trace assertion method have 
only very recently been provided, see [9], our major reference. The model 
presented in [9] does not include concurrency. In this paper we add some 
concurrency to the trace assertion method. 

The trace assertion method is based on the following postulates: 

(1) Information hiding [12, 13] is a fundamental principle for specihca- 
tion, so we describe only those features of a module that are externally 
observable; 

(2) Sequences are simple and powerful tool for specifying abstract objects; 

(3) Explicit equations are preferable over implicit equations like those of 
algebraic specihcations; 

(4) State machines are simple and powerful tools for specifying modules. 
The fundamental difference between algebraic specihcation and trace 

assertion method is that algebraic specihcation supports implicit equa- 
tions^ while trace assertion method uses explicit equations only. 

The areas of applications for the algebraic specihcations are different 
than for the trace assertion method. The algebraic specihcation is better 
suited for dehning abstract data types in programming languages (as 
SML, LARCH, etc., see [15]). The trace assertion method is better suited 
for specifying complex interface modules as for instance communication 
protocols [5, 14]. A very wide bibliography concerning the Trace Assertion 
Method can be found in [9]. 

2 Introductory Examples 

We shall consider the following simple modules: Queue, Drunk Queue, 
Very Drunk Queue, Concurrent Queue and Concurrent Drunk Queue. The 
Queue module provides four access programs: insert{i) - which inserts an 
integer i to the rear of the queue, remove - which takes no argument and 
removes the hrst element of the queue, front - which takes no argument 
and returns the value of the hrst element of the queue, rear - which takes 
no argument and returns the value of the last element of the queue. 

Since a trace specihcation describes only those features of a module 
that are externally observable, the question arises what an atomic obser- 
vation is. Following [9], we assume that an atomic observation is a pair 
[access _program{arguments)^valuejreturned)^ written as 
access_program[arguments) ’.value jreturned. No argument and no re- 
turned value is represented by nif however we also adopt a convention 
of omitting nif in particular as arguments. Hence, the Queue module 
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has the following atomic observations, called call-responses: insert {i):nil ^ 
remove{nil) :nilj front{nil) rear{nil) :b^ or, when niVs are omitted: 
insert{i)^ remove^ frontia^ rearib^ where a is the value of hrst element 
of the queue, and b is the value of the last element in the queue. 

Intuitively, a state of the queue is determined by the hnite sequence of 
integers, the last element of the sequence represents the rear of the queue, 
and the hrst represents the beginning of the queue. Note that every se- 
quence of properly used access programs leads to exactly one state. For 
instance insert{4) Ansert{l). remove. insert{7) and insert{l) .insert{7) both 
lead to the state (1,7). They could be seen as equivalent and we can 
choose for instance the trace insert{l) .insert{7) as a canonical trace rep- 
resenting the state (1,7). 

Module Drunk Queue is the same as Queue except that access pro- 
gram remove behaves differently, namely: if the length of the queue is one 
it removes the hrst element; if the length is greater than one it removes 
either the hrst element or the hrst two elements of the queue. Now the 
trace insert{4) .insert{l) .remove. insert{7) may lead to two states: (1, 7) or 
(7). However, each state is unambiguously described by an appropriate 
trace built from insert calls. 

The Very Drunk Queue has two “drunk” access programs remove.^ 
which works identically as in case of Drunk Queue, and insert{i)^ which 
enters an integer i either once or twice to the queue. In this case the 
trace insert{l) .insert{7) leads to (1,7), (1,1,7), (1,7,7), or (1,1,7, 7).. 
The canonical traces, interpreted as traces that can unambiguously de- 
scribe states, cannot be dehned in this case. The model presented in this 
paper does not work for the cases like the Very Drunk Queue. Such cases 
have been extensively discussed in [9] (see an example of a Very Drunk 
Stack) and the theory presented below can easily be extended to cover 
them. 

The Concurrent Queue has the same access programs as Queue, but 
simultaneous calls are allowed, for instance if a queue is not empty, a 
simultaneous call of insert{i) and remove is allowed, as well as a simulta- 
neous call of insert{i) and fronts or remove and rear. Simultaneous calls 
might be represented by steps like {insert{5)^ remove}^ and it is more 
convenient to use step-traces to represent the observations. For instance 
the step-trace {insert{l)}.{insert{5)^ front:!}. {remove^ rear :5} leads to 
the state (5). 

The Concurrent Drunk Queue has ” drunk” remove and allows simul- 
taneous calls. 
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3 The Model 

3.1 Type of Concurrency 

We assume that executions (observations) of concurrent behaviours can 
fully be modeled by step-sequences (or, equivalently stratihed posets). 
This means we assume simultaneity is observable and, when restricted 
to single concurrent history, it is also transitive. We also assume the a 
possibility of simultaneous execution of a and b implies a possibility of 
execution in the order a followed by 6, and in the order b followed by a (see 
for instance [6] for discussion of various models of concurrency). We are 
fully aware of the restrictions imposed by the model we have chosen. Its 
basic advantage is simplicity, and yet ability to model a wide spectrum of 
systems. We hope the simplicity of the model will help to adopt the model 
quickly by the current industrial users of the Trace Assertion Method. 

3.2 Alphabet 

What formally constitutes an alphabet from which the traces are built? 

Let / be the name of an access program and let input{f) and output{f) 
be the sets of possible argument and result values. The signature sig{f) 
is the triple: 

sig{f) = {fiinput{f),output{f)). 

We assume that neither input{f) nor output{f) are empty by having 
nil G input{f) and nil G output{f) as default. For example: 
sig{insert) = {insert^ integer ^ {nil}) ^ sig (remove) = (remove ^ {nil} ^ {nil}) ^ 
sig(front) = (fronts {nil} ^ integer) ^ sig(rear) = (rear ^ {nil} ^ integer) . 

For a hnite set E of access program names, the signature sig(E) is 
the set of all signatures of / G 

sig{E) = {sig{f) \ f e E}. 

Given the call-response alphabet Ae is the set of all possible triples, 
written f(x):g of access program names, arguments, and return values: 

= {f{x):g I / G X G input{f), y G output{f)}. 

We adopt the convention of omitting nil in signatures. For example, 
for the queue modules we have E — {insert^ remove^ fronts rear} and: 

Ae = {insert(i) \ i G integer} U {fronti \ i G integer} U 
{rear.i \ i G integer} U {remove}. 

For a given set E of access program names, we also dehne the call 
alphabet Ee and the response alphabet Oe- 
Ee = {/(x) I / G X G input{f)}, 

Oe = {d I 3/ G F’. d G output(f)} . 
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Note that the sequences and step-sequences of call-response event oc- 
currences are what is really observed. 

3.3 Trace Assertion Specification 

For every set A, let 5(A) = {A |0/ACAAAis hnite}. Elements 
of 5(A) will be called steps^ while elements of 5(A)* are called step- 
sequences. For instance, if A = {a, 6, c}, then {a, 6}. {6}. {a, 6, c} G 5(A)* 
is a step-sequence. Traditionally A denotes the empty step-sequence. If is 
a set of call a step-trace is A = Ae for some E. 

For every set A, let Rel[X) — {R | i? C A x A}, and for every 
symmetric R G i?e/(A), let cliques{R) C 5(A), the set of all cliques of 
i?, be the set dehned as follows: for every x G A, {x} G cliques{R) and 
for every hnite A = {xi, ..., x^} C A, A G cliques{C) iff (x^-, Xj) G R for 
^ 7^ J* 

In principle a Trace Assertion Specihcation is an automaton with call- 
response events as an alphabet (might be inhnite), and some sequences 
of call-response events (traces) as states (again might be inhnite). How- 
ever, the automaton is finitely defined^ in the sense that the number of 
exp/fcft equations that dehne elements of alphabet, states and the transi- 
tion function is hnite. For practical applications is is also important that 
the number of these equations is small and they are relatively simple. The 
fact that the expressions are explicit (as oppose to algebraic specihcation 
where implicit equations are more natural) is extremely important from 
the application viewpoint, even though it is not very signihcant fact as 
far as the theory is concerned (see [9] for details). 

Formally a Concurrent Full Trace Assertion Specification is a tuple: 
CFT A — {sig(E)^C^ 5^ 5cj A, enabled^ to)? 

where: # A is the set of names of system calls^ \E\ < oo, 

• sig{E) is the signature dehned by A, 

• C C 5(A^;)* is the set of canonical step-traces (state descriptors), 

• S : C X Ae — t 2^ is the sequential transition function^ and 5* : C X 

2^ is a standard extension of S onto A^ (see [4]). In general an automaton 
that is a frame for CT A is non-deterministic, so the range of S is dehned 
as 2*^, see [9], 

• Sc : C X S{Ae) — t 2^ is the concurrent transition function^ and 5* : 
C X S{AeY — 7^ 2*^ is a standard extension of 5 onto 5(A^;)*, 

• A C C X Ee is a competence set, if (c, a) ^ then applying the sys- 
tem call a at the state c is an erroneous/exceptional behaviour, like for 
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instance the remove call at empty queue, 

# enabled : C is the mapping that dehnes concurrency; it states 

what steps are enabled at each state (canonical trace), 

• to ^ C is the initial (canonical) state^ 



and the following conditions are satished: 



1. for all c G C, (^o, c) = {c}, 

2. for all c G C, and all S 2 G S{Ae)j 

Si C S 2 ^ enabled{c) => G enebled{c)j 

3. for all c G C, if {«, /?} G enabled[c)^ then ^*(c, a.f3) = ^*(c, /3.0f), 

4. for all c G C, and all A G S{Ae)^ if A = {tri, ..., a^} G enabled{c) then 

Sc{c^ A) = 5*(c, 0^1 ak)^ otherwise ^c(c, A) = 0, 

5. for all c G C and all a:d G Ae^ if there exists A G enabled{c) such that 

a:d G A and | A| > 2 then (c, a) G 1C. 

6. for all c G C and all a G 27^; there exists d G Oe such that 5(c, a:d) ^ 0, 

7. for all c G C and all a G Aej ^(c, tr) 0 {a} G enabled{c) . 



The condition (1) guarantees that the states are correctly and uniquely 
dehned by canonical traces. The second condition says that every non- 
empty subset of an enabled step is also an enabled step at the given state 
c. This means we do not enforce maximal concurrency (see [7]). The con- 
dition (3) enforces the rule that simultaneous executions of {ci, /?} implies 
a possibility of execution in the order a followed by /?, and in the order /? 
followed by a. The condition (4) dehnes the concurrent transition Sc by 
the sequential transition S. As a matter of fact, the concurrent transition 
function Sc is redundant, since it is fully described by S and enabled^ how- 
ever it makes the theoretical considerations and dehnition easier and more 
readable. However in the concrete examples it is usually omitted (see an 
example in Figure 1). The hfth condition states that concurrent activity 
is restricted to normal non-erroneous behaviour. Any exceptional activ- 
ity must be sequential. This follows from the suggestions of practitioners 
who recommend not to mix concurrency with erroneous behaviour, since 
the results might become difficult to handle. The condition (6) is based 
on the observation that we cannot practically forbid to use system calls 
” illegally” (there is always a possibility that somebody will try to ap- 
ply remove to empty queue), so the specif cation should be able to handle 
such cases. The last condition states that S and enabled do not contradict. 
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The functions S and Sc can be decomposed into S^ ^ §err ^ ^ ^err^ 

as follows. For all c G F and all a:d G Ae^ we have: 



5^(c, a:d) 



5(c, a:d) (c, a) G /C 
0 (c, a) ^ IC 



and (c, a:d) 



0 (c, a) G /C 

5(c, a:d) (c, a) ^ 1C ' 



The conditions (4) and (5) guarantee that 

0 ^ 1 . ... ak)^ and if A G enabled{c) and 
A) / 0, then A = {tr} is a singleton, and {«}) = a). 



Lemma 1. 

i, ^ U and 5^ H = 0, 
Sc = S^U S^/^ and S^ n = 0. 



The functions S^ are called norma/ transition and normal concur- 
rent transition functions, while the function S^^^ is called an exceptional 
transition function. Due to the condition (5) the function S^"^^ is of a little 
use. The concurrent full trace assertion specihcation CFT A restricted to 
the function S^ is called concurrent trace assertion specification^ denoted 
CTA^ while CFT A restricted to is called and enhancement of CT A 
and denoted ETA. Lemma 1 allows us to write (informally, but true), 
CFT A — CTAF eta. For concrete examples, CT A (i.e. the functions 
5^, S^) should be specified firsts and an enhancement should be added 
later. Lemma 1 and the condition (7) guarantee that such approach is 
sound. 

The enhancement ETA is called plain if S^^^{c^a) 0 implies there are 

Cl and 0^1 such that ^^(c, afi 0 and ^^(ci, a) 0. Non-plain enhance- 
ment means that there are some special error recovery states and some 
separate error recovery procedure ([9]). Our example in Figure 1 has a 
plain enhancement. 

We say that CFT A is deterministic iff for all c G F and all a G Aej 
|5(c, Qf)| < 1. Note that this implies \Sc{c^ A)\ < 1, for every state c and 
step A. From the examples introduced in Section 2, Queue and Concurrent 
Queue are deterministic, the remaining are not-deterministic. The concept 
of determinism dehned above corresponds to the concept of determinism 
used in automata theory (see [9]). 

For a given CFTA^ let sim : C Rel{{a \ {a} G enabled{c)}) be 
dehned as follows: (ci, /?) G sim{c) fi} ^ enabled{c) . 

For every c, sim{c) dehnes simultaneity relation at the state c. 
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Lemma 2. A G enabled{c) 



A G cliques{sim{c)) , 



From the above lemma and the condition (3) of the CFTA deh- 
nition, it follows that we may equivalently dehne CFTA as CFTA — 
{sig{E)jC^Sj IC^ simAo) with appropriate changes of the constraints (1) - 
(6). No dehnition is better than the other. For the theory the dehnition 
with Sc and enabled seems to be better (see [10]), for specifying the con- 
crete examples Sc is almost never explicitly specihed, for some cases using 
enabled is better, for others sim is better (compare [10]). 

The constraints (1) - (6) have to be proven for every concrete example. 
They are an essential part of a specihcation, the part which is frequently 
called an obligation proof in software engineering. If the specihcation is 
thoroughly thought of, those proofs are usually easy, but they may be 
labour consuming, if the specihcation is complex itself. The use of some 
automatic theorem provers as PVS or IMPS is highly recommended [10]. 

3.4 Specification Format 

To be useful in practice, the trace assertion technique must provide some 
reliable, readable and easy to use specihcation format. This issue is com- 
pletely irrelevant from the theoretical view-point, but very important if 
the technique is going to be used outside academia. The details of a 
specihcation format are to be found in [9, 10]. It uses heavily Tabular Ex- 
pressions (see [8,13] for more details), for simple cases it appears to be 
self-explained (see Figure 1). 

The technique described above (slightly changed to ht to one page, 
usually tabular expressions are also used to describe enabled) is illustrated 
in Figure 1, which presents a Full Trace Assertion Specihcation for a 
Concurrent Queue. The symbol in the dehnition of S indicates the 
parts that dehne ^ i.e. exceptional behaviour. Figure 1 provides only 
the hrst part of a specihcation. The second one, ”the obligation proof’ is 
not provided. It is relatively easy, but not so short so it is omitted. An 
interested reader is referred to [10]. 

4 Final Comment 

We extend the theory of [9] by allowing simple concurrency. The work 
can be further extended in several aspects. This is not a general model 
of concurrency since simultaneity here is transitive. For more complex 
non-sequential models a possible delay between a call and its response 
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Syntax of Access Programs 



Name 


Argument 


Value 


Action-response Form 


Full Action-response Form 


Front 




integer 


Front'.d 


Front'.d 


Rear 




integer 


Rearid 


Rear'.d 


Insert 


integer 




Insert(a) 


Insert{a):nil 


Remove 






Remove 


Remove'.nil 



Canonical Step-traces 

t is canonical = A V t = {Insert{ai)^ {/nsert(afe)}), where 1 < < size, 

to = A, i.e. empty step-sequence. 

Enabled 

if c = A then enahled{c) = { {Insert{x)^ | rr is an integer}, 
if c = {/nsert(a)|.ti .{/nsert(6)} A |c| = size then 

enahled{c) = { {i?emoce|, {Frontia}, {i?ear:6|, {i?ear:6, i?emoce} }. 
if c = {/nsert(a)} then 

enahled{c) = {{i?emoce|, {Frontia}, {i?ear:a}}U{{/nsert(rr)} | rr is an integer}U 
{{/nsert(rr), i?emoce} | rr is an integer} U{ {/user t(rr), Fronting | rr is an integer}, 
if c = {/nsert(a)}.ti .{/nsert(6)} A |c| < then 
enahled{c) = 

{ {i?emoce}, {Frontia}, {i?ear:6}, {i?ear:6, i?emoce}, {Frontia, Rear'F] } U 
{ {/nsert(rr)} | rr is an integer} U { {Insert{x)^ Remove^ | rr is an integer} U 
{ {Insert{x)^ Front:a^ | rr is an integer}. 

Trace Assertions 



5(t^ {Front'.d}) 



Condition 


Trace Patterns 


Result 




t = {Insert{d)}.ti 


{f} 


% d = nil 


t = e 


{A} 



5(t^ {Rear:d}) 



Condition 


Trace Patterns 


Result 




t = ti.{Insert{d)} 


{t} 


% d = nil 


t = e 


{A} 



5{t^ {/nsert(a)}) 



Condition 


Result 


length(t) < size 


{ t.{Insert(a)} } 


% length{t) = size 


{f} 



5{t^ {Remove^) 



Trace Patterns 


Result 


t = {Insert(b)}.ti 


A} 


% t = £ 


{A} 



Dictionary 

size : the size of the queue 
length{t) : the length of the trace t 



Fig. 1. Full Trace Assertion Specihcation for Concurrent Bounded Queue Module 
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must be modeled. It would be interesting to see how the model looks 
like for the general causality model or ”true-concurrency” model. Also 
the concept of rehnement is not considered here. Extending this model to 
multi-object case also seems to be a challenge. One may have noticed that 
proving that the concurrency law is obeyed for more complex module is 
very labor consuming but usually rather easy. A tool that can do it in an 
automatic way would be a great help. 
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Abstract. The similarity assessment proeess often involves measuring the 
similarity of objeets X and 7 in terms of the similarity of eorresponding 
eonstituents of X and 7, possibly in a reeursive manner. This approaeh is not 
useful when the verbatim value of the data is of less interest than what they ean 
potentially "do," or where the objeets of interest have ineomparable 
representations. We eonsider the possibility that objeets ean have behavior 
independent of their representation, and so two objeets ean look similar, but 
behave differently, or look quite different and behave the same. This is of 
praetieal use in fields sueh as Artifieial Life and Automatie Code Generation, 
where behavior is eonsidered the ultimate determining faetor. It is also useful 
when eomparing objeets that are represented in different forms and are not 
direetly eomparable. We propose to map behavior into data values as a 
preproeessing step to Rough Set methods. These data values are then treated as 
normal attributes in the similarity assessment proeess. 



1 Introduction 

Data is usually considered simply the raw material to be processed. In this view, 
one receives the data, possibly from a database, looks at it, maybe modifies it and then 
returns it to a database if needed. In this view, there is a clear separation between the 
code that processes the data, and the data that is being processed. However, there are 
problems when "what the data can do," and not "the way they look," is of real interest. 

In this paper we propose allowing objects to have behavior, and show that this 
opens the door for Rough Set [7] techniques to be applied to fields such as Artificial 
Life [5]. The method suggested in the paper involves representing behavior as a single 
data value or a set of data values for input to standard Rough Set methods for 
classification and decision making. These are then treated as if they are constituent 
parts of the objects. This preprocessing step allows us to retain compatibility with 
traditional Rough Set methods. 
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Assessing the similarity of two data sets, also commonly referred to as objects, 
without necessarily having any thing to do with the Object Orientation principles, is 
an important and common operation. Classification of objects is one example of the 
usefulness of measuring similarity. A concept is expressed by a set of objects that 
incorporate that concept. In the presence of uncertainty. Rough Set bounds this target 
set Hhy two sets, a lower approximation H, and an upper approximation H such that 
we have ^ e // e H. The Rough Set theory has found many practical applications. 
Similarity measures include graph measures of Semantic Relatedness [3] for 
disambiguation of natural language expressions. Correlation measures [1] for 
calculating the relatedness between word pairs, and Information Theoretic techniques 
[2] for measuring object associations. 

When using Rough Sets to assess the similarity of two objects, researchers usually 
focus on the parts that make up those objects. Let x = o(Xj, X 2 ,. . ., x^) denote an object x 
constructed from sub-objects Xj, X 2 ,..., x^. The usual approach involves computing a 
function /to measure the similarity of two objects x andjr in terms of the similarity of 
their components, i.e., similarity(x, y) = /(^similarity(Xj, yj,..., similarity(x 2 , T 2 )?---? 
similarity(x^ y^)). This is a static approach. Because each part of an object can have 
behavior (like a function, which has a source code, and also a run-time behavior), we 
wish to include this behavior in the process too, essentially along the lines of Object 
Oriented programming. This is a dynamic approach. 

The rest of this paper is organized as follows. In section 2 Artificial Life is briefly 
introduced and the reader is told why classification methods that use the verbatim 
values of an object are not of much use there. An example of when behavior is of the 
ultimate importance is provided. Section 3 presents a guideline to measure behavior 
and translate it to a single value, or a set of values, which can then be used by 
ordinary Rough Set techniques, thus retaining compatibility with existing methods 
and application software. Section 4 concludes the paper. 

2 An Artificial Life Problem 

Artificial Life is concerned with the study of systems that behave as if they are 
alive. In most cases the systems are pieces of software, usually called creatures, that 
live in an artificial environment. Each creature can be considered a plan that when 
executed, affects its environment. A simulator can generate new creatures from 
scratch randomly, or by applying genetic operations of mutation and crossover to 
existing creatures. Rules of the environment are enforced on the creatures, and pre- 
defined fitness measures are used as a guide in creating the next generation. 
Thousands of generations are tried, and the creatures usually evolve to display certain 
characteristics that help them survive by conforming to the rules of the environment 
as much as possible. The rules determine the physics of the artificial world, and 
dictate how “normal” the creatures will behave when compared to the real world. 
Considering the random elements present in this process, it is no wonder that 
spontaneous emergence of behavior is one of the key characteristics observed in an 
Artificial Life environment. It is usually very hard to predict how the creatures will 
evolve. One usual behavior is that herds of creatures show up. Members of each herd 
have a lot of resemblance to each other, and differ substantially from members of 
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other herds. Artificial Life techniques have been used to breed programs that perform 
useful functions [4]. 

In this paper, behavior is defined as the side effects of interpreting data. This 
interpretation is domain dependent, and can for example be the same as the execution 
of code produced by an automatic code generator. The definition of behavior can be 
generalized to include static data too. If there is an easily detectable relationship 
between the representational format of an object and the effects of its interpretation in 
the environment, then there is no need to interpret the object. If the data are not 
interpreted, then behavior is defined as their verbatim values. 

If the simulated environment is non-trivial then there is no direct correspondence 
between the source code of a creature and its behavior. The reason is that in a non- 
trivial system, on one hand there is more than one way to cause the same effects, and 
on the other hand executing seemingly similar, but not identical pieces of code can 
have very different results. In general, a behavior measurement procedure might have 
to be told to look for specific patterns of interest in certain locations of the system. 
This problem is greatly reduced when moving to object oriented programming, where 
global data is more contained and manageable, and completely disappears in a 
functional programming environment, where global state does not exist. 

Comparing behavior is of paramount importance in fields like Artificial Life. One 
concrete problem is the classification of creatures produced automatically. Because 
there are thousands of creatures at any time in an Artificial Life simulator, and their 
behavior may change from one generation to the next, it is very difficult to do the 
classification manually. One example problem is the classification of the creatures 
into hunters and non-hunters. Consider an imaginary artificial world, where plant food 
is created randomly by the simulator. The simulator ages the creatures at regular 
intervals, which makes them weaker. When they have passed a threshold of weakness, 
they die and are converted into plants. Creatures all start as peaceful vegetarians, but 
after a while some may begin developing the traits associated with hunting, like 
attacking others. This results in the attacked creatures becoming weaker, and thus 
dying sooner. Such behavior could develop simply because it may be rewarding for 
the creatures that display them: As the number of competing creatures reduces, there 
is more food to eat. After a while, they may learn that it is a good idea to hang around 
weak animals. Still another trait would be to attack weaker animals and then wait, 
which would probably be the most rewarding behavior. 

In this example, there is no explicit hunting behavior, because all the creatures do 
is eat plants. However, their behavior in the last case may very well be considered to 
closely resemble that of hunters. Behaviors like attacking other creatures (which 
makes them weak), waiting near old creature (because they will die in a short time), 
and moving fast (to chase other creatures) are some of the condition attributes that can 
be used to help in the classification of the creatures into hunters and non-hunters. 
Another attribute, creature size, is of doubtful value, but can be considered if the 
expert looking at the simulator thinks there is a correlation between it and hunting. 
Using a Rough Set paradigm, one can come up with Table 1 for the creatures of this 
artificial world. Note that there are no variables anywhere in the simulator to tell us if 
a creature attacks others, or waits near old creatures, or moves fast, and the values 
should be extracted from the creatures’ behavior. 
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1 Condition Attributes I 


Decision Attribute 


mm 


Attacks? 


Waits near old creatures? 


Moves fast? 


Creature size 


Hunter? 


1 


yes 


yes 


yes 


small 


yes 


2 


yes 


yes 


no 


small 


yes 


mm 


no 


no 


yes 


big 


no 


4 


no 


yes 


yes 


small 


no 


5 


yes 


no 


yes 


big 


yes 


6 


no 


no 


no 


small 


no 


7 


no 


no 


no 


small 


no 


8 


no 


yes 


yes 


small 


yes 



Table 1. Condition attributes used to determine if the given animals are hunters 

Table 1 uses intuitive notions about how a hunter should behave. For example, it is 
clear that animals 1 and 2 are smart hunters. They attack others, and wait near old 
(weak) animals, which increases their chance of finding food in a short time, as old 
animals die sooner. This does not necessarily mean that they hang around the same 
creature that they attacked, though. Animal number 5, on the other hand, is a stupid 
hunter because it does attack others, but being a fast mover, does not wait to use the 
results of its efforts. Animal 4 can be compared to a vulture. It does not attack others, 
but does wait near old animals. Animal 8 also acts like a vulture, but is classified as a 
hunter, which is counter-intuitive. This could be the result of an error on the part of 
the expert who did the classification. 

The above table gives the following indiscemability classes: {!}, {2}, {3}, {4, 8}, 
{5}, {6, 7}. Following Standard Rough Set techniques gives us H= {1, 2, 3} and H = 
{1, 2, 4, 5, 8} . If we change the value of creature size for creature 8 from big to small, 
then we get the following indiscemability classes: {!}, {2}, {3}, {4}, {5}, {6, 7}, 
{8}. Deleting the size attribute gives the original indiscemability classes. This hints 
that creature size is redundant. This is also intuitive, as in nature the physical size does 
not determine if an animal hunts others. 



3 Mapping Behavior 

The usual way of comparing two objects is to directly compare the values of their 
corresponding parts, and then use some statistical or heuristic function to come up 
with a measure of similarity. This method is used in many applications. Using a 
compatible way to measure behavior will enable us to continue to use the same 
programs and methods. This can be achieved by the introduction of a mapping 
frmction which takes interpretable data, and produces a value or a set of values. These 
can then be used in the similarity assessment procedure. In general the result of 
interpreting the data may depend on the global state, and the interpretation may 
change this state. More formally, we define a function / such that: 

• /(<p, a) = {(p, a} where a is a data stmcture that is not interpreted and <p is the 
global state. The global state does not change and the return value is a itself 

• /(<p, P)= {(p', O'} where j3 is interpretable data, and o' is a measure of changes 
resulting from interpreting j3. <p is the starting state and <p'is the resulting state. 

This ensures that the same method can be applied to objects with and without 
inherent behavior. It can also be applied when an object has behavior that is to be 
ignored for some reason, in which case j3 is treated like a. Mapping behavior back to 
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the form of data values (a or a) makes it unnecessary to introduce new terms and 
techniques, and allows us to retain compatibility with existing methods. 

/ is domain dependent and should be defined by the experts of the domain. An 
example for automatic code generation in the functional programming paradigm is 
that / is simply the result of executing the function For a neural network, / 
provides the input and allows the network to produce its output. 

In Table 1, the value of a condition attribute such as "Attacks other creatures" is 
obtained by a function with being the global state of the simulator at the 

time the function starts execution, and J3 being the representation of the creature. The 
result is a possible change in the simulator (leading to the global state (p^), plus a 
return value from the set {yes, no}. Similar functions ...) should be used to get 

the other condition attributes. 



4 Conclusion 

We have suggested taking a broader look at data in the similarity assessment 
process. We propose allowing data to have behavior, and using this behavior to 
measure the similarity of two objects. The main point to consider is that comparing 
parts of an object based solely on their data values may not reveal the complete 
picture. When the behavior of an object is more important than its representational 
format, then the data should be interpreted and the results should be included in the 
similarity assessment process. The conceptually simple technique of expressing 
behavior in terms of the results of its execution allows for the easy addition of 
behavior to existing similarity assessment systems. This makes it possible for 
standard, well-understood methods to be applied to domains such as Artificial Life, 
where systematic ways of comparison and classification are lacking. 
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Abstract. Some organizations are commissioned to boost every kind of innova- 
tion in favoring collaboration between small businesses and scientific circles. 
Our research aims at developing a decision-aid tool to help intermediate organi- 
zations in their search for innovative enterprises and at detennining if these en- 
terprises are receptive to collaboration with a university. This helping tool con- 
sists of a set of decision rules thanks to which the enterprises are selected. This 
set of mles was established with the rough set method. The problem we have to 
face comes under the problematic P.a and the originality of the paper is that we 
will show the impact of the choice of decision rules on the type I error and thus 
on the percentage of objects incorrectly classified. 



1 Introduction 

In the context of the present international competition, the innovative nature of an 
enterprise is often a determining and not insignificant advantage. Unfortunately this 
innovative will is most of the time checked by a lack of human, material as well as fi- 
nancial means. Scientific circles - and more particularly universities - generally have 
these means. They are moreover willing to put them at the enterprise disposal. The 
function of some intermediate organizations is initiating and developing collaborations 
between these two partners. These intermediate organizations cannot materially get in 
touch with each enterprise and it is important for them to be able to contact the enter- 
prises that are the most likely to develop any kind of collaboration. Our research fo- 
cuses on segmenting the market of firms to contact. From the characteristics of an 
enterprise, we have to be able to determine if this enterprise will be receptive to col- 
laboration. In order to carry out the segmentation of the enterprises, a decision-aid 
tool has been developed. This helping tool consists of a set of decision rules thanks 
to which the best enterprises are selected, namely ones the most able to develop a 
fruitful collaboration with a university. This set of rules was established with the 
rough set method. The application is given in the second part. Section 3 constitutes 
the originality of this paper. Considering our problem as a P. a problematic problem, 
we show the impact of the choice of decision rules on the type I error and thus on the 
percentage of objects incorrectly classified. 
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2 Application of the Inductive Approach 

In this section, we analyze the problem of the selection of enterprises using the rough 
set method (explanation about this method can be find in [1,2]). This study was done 
using ROSE software [3]. In our case study, the goal of the rough set method is to 
discover relationships between objects from the information system and pre-defined 
sorting rules. These relations will express in a “if... then...” decision rules form the 
expertise of the decision maker on the basis of sorting examples. The enterprises of our 
sample have thus been classified into 2 groups in order to constitute the decision 
attribute. Classification of new enterprises will be done on the basis of the decision 
rules generated. 

The first step of the analysis consists in the creation of the decision table. We pre- 
sent hereafter the condition attributes taken into account in order to explain the col- 
laboration potential of an enterprise. Our decision table includes 31 condition attrib- 
utes grouped into 4 different themes : 

- Geographical situation : localization (Al); 

- Branch of industry : branch of industry (A2), number of enterprises per sector (A3), 
evolution of the number of enterprises per sector (A4), Number of workers per sec- 
tor (A5), evolution of the number of workers per sector (A6); 

- Size : type of enterprise (A7), total of the balance sheet (A8), evolution of the total 
of the balance sheet (A9), average staff (A 10), evolution of the average staff (All); 

- Financial situation : corporate performance (profit or loss) (A 12), Evolution of the 
corporate performance (profit or loss) (A 13), added value (A 14), evolution of the 
added value (A 15), cash-flow (A16), evolution of the cash- flow (A17), stock- 
holder’s equity (A 18), evolution of stockholder’s equity (A 19), working capital 
(A20), evolution of the working capital (A21), added value per worker (A22), evolu- 
tion of the added value per worker (A23), financial costs/VA (A24), evolution of the 
financial costs/VA (A25), output of long-lasting resources (A26), evolution of the 
output of long-lasting resources (A27), liquidity in the strict sense of the word 
(A28), evolution of the liquidity in the strict sense of the word (A29), own capi- 
tal/total of liabilities (A30), evolution of stockholder’s equity /total of liabilities 
(A31). 

From the available database, 200 firms were sorted into two groups according to 
whether they could be considered, a posteriori, as enterprises worth contacting or not. 
It should be noticed that the research is based on small (and medium sized) businesses 
situated in the province of Hainaut in Belgium. 

The information table is constituted by a set of 200 enterprises described by 31 
condition attributes corresponding to the 31 selected criteria and by one decision 
attribute representing the sorting group of the firm. 

In order to exploit this table, it is necessary to discretizise the raw data. On the basis 
of the raw data and different discretization factors, we are now able to build the deci- 
sion table that will be used by ROSE. 

The rough set analysis constructs minimal subsets (reducts) of independent criteria 
ensuring the same quality of sorting as for the whole set of condition attributes. This 
leads us to the building of 266.088 different reducts. It was evidently impossible to 
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test all of them to find the best one. So we chose to construct one reduct manually. 
We first inserted into the attributes list constituting the reduct the attribute ensuring 
alone the best quality of sorting. We then proceeded by successive adding of attrib- 
utes in the list in order to maximize the quality of sorting of the combination of all 
attributes in this list until we attained a final quality of sorting equal to one. This leads 
to the creation of 4 reducts. Several validation tests highlighted that the following one 
was the best choice : { A2, A3, A 15, A 16, A22, A24) . 

These reduct was used to generate a minimal set of rules that cover all the objects 
from the reduced decision table. The number of rules generated is equal to 84. You will 
find hereafter the interpretation of one of these rules : 'iF the "added value* has gone 
through one positive evolution followed by a negative one, with a higher level in 1 995 
than in 1997, THEN, the enterprise studied can be considered as worth being con- 
tacted. One can effectively suppose that an enterprise whose products are going out 
of fashion or whose profit margin is lower, would want, for example, to develop new 
products”. 

The major objective of the study is to use the sorting rules discovered from the de- 
cision table to support new sorting decisions. The 84 sorting rules generated have 
thus to be validate. A cross-validation test was carried out. We realized 4 validation 
tests on our sample of 200 enterprises. Table 1 contains the average percentages of 
objects correctly classified or not. 



Table 1, Validation tests results 





Estimated belonging 


to the group of: 


Effeciive belonging Lo : 


Good prospects 


Bad prospects 


Good prospects 


61,5% - [H] 


9% - [M 2 ] 


Bad prospects 


21% - |M4 


8,5% - |H| 



Results of type H express correct classification. Results of type M express incorrect 
classification. Mi represents type I error and M 2 type II error. In that case, the results 
are very satisfying since 70 % of firms have been correctly classified. In order to re- 
duce the type I error, we have taken into account the particularity of the problem we 
have to solve which is a problematic P.a kind of problem. 



4 Problematic P.a 



The problem we have to face comes under problematic P.a [4]. Indeed, our objective is 
to choose, within a database, a non ordered subset of enterprises likely to develop a 
fruitful collaboration with a university. As we are only interested in the 'best’ firms, we 
will use only the decision rules leading to the conclusion 'Good Prospects’. From the 
database which enables us to generate the decision rules and realize a validation test 
(see table 1), we can observe that 82,5% of firms have been classified as good pros- 
pects. However, we can also note that the type 1 error is about 2 1 %. 
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It is clear that when the decision maker applies the decision rules to a new database, 
all the enterprises selected can not be contacted at a same time. The idea is thus to 
evaluate the impact of the choice of rules on the selection of enterprises and conse- 
quently to identify the influence of the choice of rules on the percentage of correctly 
classified enterprises. 

For the initial analysis of our problem by the rough set method, the algorithm used 
to generated the decision rules was an algorithm inducing the minimal set of rules 
covering the entire database. Now, we propose to generate the rules from an algorithm 
inducing a satisfactory set of rules. This category of algorithms gives as a result the 
set of decision rules which satisfy a given a priori user’s requirements. The measure 
of quality we choose to use is relative to the strength of the rules. We present in table 
2 the main results obtained. The meaning and the interpretation of each column of the 
table are the following : 

1. This column indicates the strength of the rules. It specifies the minimum percentage 
of objects from the database the rule has to cover to be generated; 

2. This column points out the total number of rules generated as a function of the 
minimum percentage of objects to be covered by each rule. For example, if each rule 
has to cover at least 1% of objects from the database, the total number of decision 
rules generated is 726. As we could expect, we notice that the total number of rules 
generated decreases depending on the required strength of the rules; 

3. Identically to column (2), this one gives us the number of decision rules leading to 
the conclusion ‘Good Prospect’; 

4. Similarly to the way it was calculated in the previous section, this column gives us 
the percentages corresponding to type I error (wrong classification in the set of 
good prospects). For example, if during the validation tests we use only decision 
rules covering at least 3% of objects from the database, we can notice that 20,7% of 
the total number of enterprises have been incorrectly classified into the set of 
‘Good Prospects’. It is interesting to observe that the stronger the rules are, the less 
the percentage of error is high; 

5. The values included in this column indicate the number of firms classified into the 
set of ‘Good Prospects’ using the rules from column (3). As expected, this number 
decreases continuously. The more the rules have to be strong the more the number 
of classified firms diminishes. This is quite normal because, using fewer decision 
rules, a more and more enterprises can not be classified any more, whether in the set 
of ‘Good Prospects’ or in the set of ‘Bad Prospects’.. 

6. Finally, this column gives us the percentage of objects incorrectly classified into 
the set of ‘Good Prospects’. For example, if we use rules covering at least 1% of the 
objects from the database, the percentage of incorrectly classified objects relative 
to the total number of objects classified in the set ‘Good Prospects’ (177 - see (5)), 
is equal to 28. All the elements of this column allow us to make an interesting ob- 
servation : the stronger the generated decision rules are, the lower the percentage 
of incorrectly classified firms. This means that using the strongest rules, the deci- 
sion maker will be able to select a subset of enterprises, smaller certainly, but in- 
creasing his chances of selecting the best ones and thus decreasing the risk of a 
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bad classification. Then, if we use decision rules covering 1 1 % of enterprises from 
the database, 22 firms are classified as ‘Good Prospects’ with no risk of error. 



Table 2. Main results of the analysis 



0) 


(2) 


(3) 


(4) 


(5) 


(6) 


1 


726 


317 


24.8 


~ 


28 


2 


240 


165 


23.8 


171 


27 


3 


no 


35 


20.7 


139 


26.5 


4 


42 


19 


15.6 


86 


21 


5 


27 


4 


11.3 


64 


18 


6 


10 


4 


10.9 


53 


14 


7 


3 


1 


9 


33 


12 


8 


3 


1 


8 


32 


9.5 


9 


1 


1 


3.3 


25 


4 


10 


1 


1 


3.3 


25 


4 


11 


1 


1 


0 


22 


0 



On the basis of the results obtained, one recommendation for the decision maker will 
be to first use the strongest decision rules in order to create a first subset of firms and 
to work on it until it is exhausted and then enlarge the set of rules successively in 
descending order of strength of the rules. 



5 Conclusion 

The aim of this paper was the building of a decision model constituting a decision aid 
tool allowing intermediate organizations to optimize the collaboration possibilities 
between enterprises and universities. This model is represented by a set of decision 
rules allowing the selection of the enterprises most likely to develop a fruitful collabo- 
ration with a university. Decision rules were generated using rough set analysis. The 
rough set method was used first to find the minimal subset of attributes ensuring an 
optimal quality of sorting and second to generate a set of decision rules. 
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Abstract. This paper reports on the design and implementation of a timed Petri 
net interpreter. Currently, several Petri net simulators written in the Paseal and 
C languages are available. However, our approaeh is to use an expert system 
language ealled CEIPS to write an interpreter to exeeute Petri nets. The major 
differenee between a rule-based expert system language like CEIPS and 
languages sueh as Ada, C, or Paseal is that the rules of CEIPS ean be aetivated 
eoneurrently, while the statements of other languages are sequential. In this 
projeet, we first design a Petri net language; programs written in a Petri net 
language ean deseribe Petri net behavior. Then, we will design and write an 
interpreter in the CEIPS language that ean exeeute Petri net programs. The 
CEIPS language is a data driven language, and the interpreter ean seareh for 
enabled transitions for firing. With this approaeh, we ean avoid eomplieated 
data struetures and their implementations. 



1 Introduction 

Simulation analysis is an effective approach to solving problems, because it can obtain 
the desired information without much cost or the inconvenience of manipulating the 
real world system. The definition of simulation given by S. V. Hoover and R. F. Perry 
[11] indicates, “simulation is a process of designing a mathematical or logical model 
of a real system and then conducting computer-based experiments with the model to 
describe, explain, and predict the behavior of the real system.” From this definition, 
we see that a good model is an essential element for any simulation analysis. 
Simulators are frequently used to model systems that involve concurrent activities. 
Some simulation languages provide special tools for modeling system activities. 
GPSS has its own flowchart presentation, SLAM uses its network graphs, and SIMON 
uses activity diagrams. One important drawback of these tools is their lack of 
generality [26, 27]. 

A Petri net is a general formal modeling tool that can be used to describe and 
analyze the flow of information in a system. It is a powerful modeling tool 
particularly for the representation of asynchronous, and concurrent activities [2, 4, 7, 
15, 21, 24]. By using Petri net models, one can gain several advantages over other 
models. The basic principle is easy to understand, the ability to extend basic Petri net 
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models is quite flexible, and they ean deseribe the dynamie behavior of systems. The 
design of a Petri net interpreter using the CLIPS language [10] is a new approaeh to 
simulation. Several Petri net simulators, written in Paseal and C languages, already 
exist [4, 8, 14, 22] so a natural question to ask is, “why another one?” The answer and 
the reason for the development of the interpreter, is that none of the existing 
simulators ean exaetly model Petri nets. The interpreter developed here eomes eloser 
to this goal. 

A Petri net is an asyehronized, eoneurrent, and token driven system. Sinee a 
proeedural programming language is statement-oriented languages and exeeuted 
sequentially, any Petri net simulator written in a proeedural programming language 
hardly simulates a system preeisely. Our approaeh is totally different. In this 
researeh, the CLIPS programming language was ehosen to implement the interpreter 
beeause CLIPS is one of the produetion based programming languages eonsisting of 
rules and a database against whieh those rules are eompared. The CLIPS language is a 
eoneurrent, and data driven language that ean deseribe Petri net behavior. The use of 
the CLIPS language not only eases the design of the interpreter, but also helps in 
design of the Petri net language (PNL). 

In this paper, we will deseribe the design of a PNL whieh ean represent Petri net 
models. Then, we develop an interpreter that will be used to exeeute the programs 
written in the PNL. The results from the output of the interpreter will supply 
information on the system performanee. This approaeh will provide an alternative for 
simulation. The eoneept of using a Petri net interpreter for simulation is given below: 

Modeling Developing Input Output 

I I I I 

System > Petri Net > Program in PNL > Interpreter > Result 

Fig. 1.1 Petri net interpreter, an alternative method of simulation. 



2 Petri Nets 

The theory of Petri nets was first developed by Carl Adam Petri in 1962 to model 
those systems with interaeting eoneurrent eomponents [23]. Due to its natural 
representation, Petri nets have been adopted by a wide eolleetion of systems for 
modeling. For more detailed deseriptions see [3, 5, 16, 18, 20, 25]. The original Petri 
net only allowed a single are between plaees and transaetions. A transition ean fire if 
the input plaees eontain tokens [23]. The theory of Petri nets has been further 
developed by different researehers with various motivations; therefore, a number of 
variant theories about Petri nets have appeared and eontinue to grow. In this paper, we 
adopt a formal definition of Petri net is given by Peterson and his rules [23]. 



2.1 Petri Net Graphs 

A Petri net graph is a direeted symbolie graph that eontains plaees, tokens, transitions, 
and ares [23]. A plaee is denoted by a eirele. A dot in a eirele means this plaee 
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contains a token. A Petri net with tokens is called a marked Petri net. The oecurrence 
of tokens indicates the state of the net. A transition is denoted by a bar which controls 
the flow of tokens between places. These places and transitions are connected by 
directed ares. If an arc leads from a place to a transition, then this plaee is an input of 
the transition. If an arc leads to a place from a transition, then this plaee is an output 
of the transition. A Petri net executes by firing transitions that remove tokens from its 
input places and deposits tokens to its output places. 

2.2 Time and Timed Petri Nets 

The two main techniques for handling time within Petri nets [9, 12, 16, 19] are: 
Ramchandani’s timed Petri nets or Holliday’s Generalized timed Petri nets [12] and 
Merlin’s time Petri nets [1]. A timed Petri net represents time by attaching a finite 
duration to transitions that are called deterministic firing times. A time Petri net uses 
two real numbers to form an interval, minimum and maximum time, associated with 
each transition. Time can be either attached with transitions or plaees or both. This 
project will implement timed Petri nets. If all inputs of the transition eontain 
sufficient tokens, then this transition will remove the input tokens immediately and 
deposit the output tokens after the processing time. 



3 Design of a Petri Net Language 

In this research project, we have developed a Petri net language (PNL). Programs 
written in the PNL will have a one to one correspondenee with Petri net graphs. Onee 
we have defined the Petri net graph for a given system, we ean convert the graph into 
a program. A program written in PNL consists of two files, a structure file and an 
initial data file. The structure file expresses transition structure within the Petri net, 
while the initial data file indicates the initial marking in that net. We can evaluate the 
performance for distinet initial markings in different initial files. The Syntax of PNL 
eonsists of two files, the strueture file and the initial data file as shown below: 



1 Structure file : 
<strueture file> 
<transition list> 
<transit item> 

<input> 

<output> 

<place list> 

<plaee item> 

<place number 
<transition number> 
<time> 



::= <transition list> 

::= <transit item>|<transit item> <transition list> 

::= TRANSITION <transition number> TIME <time> 
INPUT <input> OUTPUT <output> 

: := <plaee list> 

::= <plaee list> 

::= <plaee item>| <plaee item> <place list> 

: := <plaee number> 

::= [POSITIVE INTEGER] 

::=[POSITIVE INTEGER] 

::=[POSITIVE INTEGER] 



2. Initial data file : 

<initial data file> : := <plaee list> 
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4 The Clips Language 

The C Language Produetion System (CLIPS) language is an Expert System Language 
developed by the Lyndon B. Johnson Spaee Center [10]. The basie elements of 
CLIPS are: 

1. Faet-list: global memory for data 

2. Knowledge-base: eontains all the produetion or rules 

3. Inferenee engine: eontrols overall produetion exeeution 

CLIPS is a data driven language. Its programs eonsist of faets and rules. The faets are 
the data required for exeeution, and the job of the inferenee engine is to determine 
whieh rules should be exeeuted. Like OPS5 [6], CLIPS is ealled a produetion 
language or a rule-based language. The main differenee between a rule-based 
language and an imperative language sueh as Ada, C, FORTRAN, PF/1, or Paseal is 
that programs written in a rule-based language are data-driven programs, so programs 
eannot be executed without facts. Another difference is that the rules execute in 
parallel, while the other languages are sequential in nature [10]. 

4.1 Fact-List 

In CFIPS, the assert command is used to put data in the fact-list; the retract 
command is used to remove data from the fact-list. A fact contains one or more fields 
enclosed in a parentheses. 

To assert two facts: (assert (This is a test.)) 

(assert (Any more tests?)) 

then these two facts are placed in the fact-list as below: 
f-l(This is a test.) 
f-2(Any more tests?) 

To retract a fact we need to specify the fact index of the fact. For example, 
(retract 1) will remove the fact (This is a test.) from the fact-list. 

4.2 Knowledge Base 

The structure of a rule is similar to an if then statement in a procedural language. The 
format of a rule is: 

(define rule_name 

(condition_l); pattern 1 
(condition_2); pattern 2 

. . . ; left-hand side (FHS) of the rule 
(condition_n) 

(action_l) 

(action_2) 

. . .; right-hand side (RHS) of the rule 
(action_m) 
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CLIPS compares the LHS of the set of rules against facts in the fact-list. If all the 
conditions of a rule match the facts, the rule is allowed to invoke actions in the RHS. 

Example: 

(defrule animal) 

?fact-l ^ (cat);?fact-l is a variable for cat 
?fact-2 ^ (dog) ;?fact-2 is a variable for dog 

(retract ?(fact-l ?fact-2) 

(assert (animals are cats and dogs))) 

This short program does the following things: if there is a “cat” and a “dog” in the 
fact-list, then it removes “eat” and “dog” from the fact-list and asserts “animals are 
eats and dogs” in the fact-list. 



5 The Interpreter 

The interpreter is written in the CLIPS programming language. The algorithm for the 

interpreter is given as follows: 

1. Read the user’s program: the interpreter reads the user’s program that is written in 
Petri net language, ineluding a structure file and an initial data file. Both files are 
read line by line, and then we assert each line into the fact-list until the end of the 
file. 

2. Ask the user to choose between a time-oriented or place-oriented simulation: 
This information will determine the way to terminate the program. A time- 
oriented simulation will execute the program in a given amount of time, and a 
plaee-oriented simulation will execute the program by monitoring the number of 
inserted tokens until a fixed number is reached in a certain place, then the 
program will terminate. 

3. Set and maintain a global clock: First set the clock to zero and assert this into the 
fact-list. When no rules matched, the global clock increases by one. 

4. Add processing time and current time: the interpreter sums each processing time 
of the transitions and the current time to be the firing time of transitions, and then 
inserts this information into the fact-list. 

5. Check all transitions that are enabled to fire in the fact-list: If there are enabled 
transitions, all input tokens of transitions are in the fact-list, then go to step 7. 

6. Check unenabled transitions: At step four, the interpreter caleulates the firing time 
for all transitions, but some transitions are not enabled at step five, therefore we 
need to remove that firing time information from the fact-list. 

7. Retract input tokens of enabled transitions from the fact-list: When a transition is 
enabled, it removes input tokens from the faet-list immediately. 

8. Add output tokens of firing transitions into the fact-list: If there are times that are 
the same as the current time, the interpreter inserts output tokens into the fact-list. 
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9. Repeat step 3 to step 8 until the end of exeeution: For a time-oriented simulation, 
the interpreter eheeks the global eloek to terminate. For a plaee-oriented 
simulation, the interpreter deereases the number of tokens, given by the user, by 
one whenever the token is inserted until the number beeomes zero. 



6 Macintosh Interface for Clip 

This interpreter is written in CLIPS 5.0 and run on an Apple Maeintosh eomputer, this 
seetion provides a brief diseussion of some menu eommands needed to run the user’s 
programs. Commands ean be entered in the dialogue window. 

6.1 The File Menu 

The CLIPS file menu that ineludes the following eommands: 

New: This eommand will open a window named “untitled” for editing. 

Open: This eommand allows the user open a text file for editing. 

Load: This eommand allows the user to load a file into the knowledge base. 

Save: This eommand saves the file in the aetive edit window. 

Save as: This eommand allows the file in the aetive edit window to be saved 
under a new name. 

Quit: This eommand exits CLIPS. 

The Execution Menu 

The CLIPS Exeeution Menu that ineludes the following eommands: 

Reset: This eommand is needed before eaeh run so that it ean initialize the faet- 

list. 

Run: This eommand is to exeeute programs. 

Option: Under this window, we need to ehoose “Faet Duplieation”, beeause it is 
possible to have multiple tokens in a node (plaee). 

Onee we are in the dialogue window, we first ehoose “Faet Duplieation” under the 
exeeution menu. This allows multiple tokens in the same plaee, then we load the 
interpreter into the knowledge base. Before eaeh run, it is neeessary to reset the faet- 
list. 



7 Execution and Output 



There are two ways to exeeute a program written in the Petri net language using the 
interpreter developed in this projeet, at exeeution time or at monitored plaee. The 
formats for both methods are given below: 
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By Execution Time: 

Please enter the name of stmeture file: 
Exeeution is by time or by node? 
(time/node)(time/node) 

Enter the duration of exeeution: 

Enter the name of initial file: 

Enter the input transition(s): 

Enter the output transition(s): 



By Monitored Place: 

Please enter the name of stmeture file: 
Exeeution is by time or by node? 

Enter duration of exeeution: 

Enter node to be traeed: 

Enter the name of initial file: 

Enter the input transition(s): 



The name of stmeture file is the stmeture file of the Petri net. The name of initial 
file is the data file of initial tokens. If the user ehooses to exeeute by time, the 
duration of exeeution is the total amount of time alloeated, otherwise the duration of 
exeeution by plaee will mn until some amount of tokens have been inserted into that 
plaee. Both methods, by time and node, will result in the same output. The output 
from the interpreter provides some important information about the system 
performanee. The output information ineludes the following three lists: 

1. A list eonsists of times and firing transitions: this list provides the time when eaeh 
transition is fired. From this list, one ean obtain the pereentage of system idle 
time and the pereentage of system busy time. 

2. A list of transitions, total number of firings, and average time between fires: this 
list tells how busy the system is for eaeh transition. From this list, one ean 
determine the bottleneek of the system. Also, one ean obtain the busy time and 
idle time for eaeh transition. 

3. A summary list that ineludes total time, number of input jobs, number of finished 
jobs, number of unfinished jobs, and average time to eomplete a job (the 
throughput). 



8. Conclusion and Future Research 

In this researeh projeet, we have developed a Petri net language that is based on our 
observation of a variety systems and their timed Petri net models. The language is 
simple and small. It is easy to understand and easy to program. The Petri net 
language we have developed exaetly eharaeterizes the dynamie aetivities of Petri nets. 
Therefore, any system that ean be modeled by a timed Petri net deseribed in this paper 
ean be direetly implemented by a program in the Petri net language. 

In simulation, the most diffieult proeess is model validation and model verifieation. 
A Petri net is a dynamie system that models a system direetly; the exeeution of Petri 
nets will perform validation and verifieation at the same time. The use of the Petri net 
language allows the simulation of a Petri net model without eonversion to other 
models [26] and without the diffieulty of using traditional languages. Also, we have 
sueeessfully developed a Petri net interpreter by using a rule-based expert system 
language, CEIPS. This eased the diffieulty of utilizing sophistieated data struetures 
and their eomplieated implementation. The main advantage of using the CEIPS 
language is that CEIPS ean be aetivated in parallel, so we ean simulate Petri nets more 
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effectively. Other than the examples presented in this paper, we have tested several 
more complicated Petri nets. We have found that the interpreter executes a Petri net 
program that exactly reflects the execution of the Petri net model of the physical 
system. We believe that the approach chosen to develop an interpreter is an 
alternative method of simulation. 

The statistics we collected include: the firing time, the total number of firings, the 
average firing time of each transition, the number of input jobs, finished jobs, 
unfinished jobs, and the total simulation time. For the further research, there are 
several points to investigate: one is to attach time to places so that both transitions and 
places can be active concurrently. Also, we will consider stochastic firing times with 
different statistical distributions for transitions, to make the effect of the simulation 
more realistic [13, 17] Finally, some additional features like priorities or colors may 
be added to the system to make the Petri net more fruitful. 
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Abstract. Soft computing (SC) techniques such as fuzzy logic (FL), 
neural networks (NN), and genetic algorithms (GA) are complemen- 
tary. Each SC technique has particular computational properties that 
make them suited for particular problems and not for others. Thus, in 
solving complex, real-world problems, we need to incorporate some SC 
techniques into the application systems to increase the systems^ ‘intel- 
ligence”. In this paper, we first propose an agent-based framework for 
integrating SC techniques into practical application systems. We then 
discuss the design and implementation of a platform independent soft 
computing support environment based on the framework. We call such 
an environment agent-based soft eomputing soeiety. Such a society can fa- 
cilitate the design of truly robust, flexible and adaptive hybrid intelligent 
systems. 



1 Introduction 

Soft computing is a term that describes a collection of techniques capable of deal- 
ing with imprecise, uncertain or vague information. SC is not a single method- 
ology. Rather, it is a consortium of computing methodologies that collectively 
provide a foundation for the conception, design and deployment of intelligent 
systems. The principal members of SC are fuzzy logic (FL), neural network 
(NN), genetic algorithm (GA) etc. SC technologies such as FL, NN, and GA 
are complementary rather than competitive. While these SC techniques have 
produced encouraging results in particular tasks, certain complex problems can- 
not be solved by a single SC technique alone. Each SC technique has particular 
computational properties that make them suited for particular problems and not 
for others. For example, in our ongoing project entitled ^Tinancial Investment 
Advisor Using Intelligent Agent Teehnologies''% the NN was used as a pattern 
watcher for stock market; the GA was used to predict interest rate; and the 
approximate reasoning based on FL was used to evaluate clients financial risk 
tolerance ability etc. Thus, in solving complex, real-world problems, we need to 
incorporate some SC techniques into the application systems. 

An agent is an encapsulated computer system that is situated in some envi- 
ronment and that is capable of fiexible, autonomous action in that environment 
in order to meet its design objectives [13]. Recently, N. R. Jennings gave a qualita- 
tive analysis to provide the intellectual justification of precisely why agent-based 
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systems are well suited to engineering complex software systems[14]. Based on 
the analysis, he argued that: (1) Agent-oriented approaches can significantly en- 
hance our ability to model, design and build complex, distributed software sys- 
tems; (2) As well as being suitable for designing and building complex systems, 
the agent-oriented approach will succeed as a mainstream software engineering 
paradigm. 

We notice that there are many agent-based application systems demonstrate 
that agent-based systems are a useful and powerful solution technology. However, 
these developments also show that designing and building agent systems is dif- 
ficult. At present, there are two major technical impediments to the widespread 
adoption of agent technology [15]: (1) the lack of a systematic methodology en- 
abling designers to clearly specify and structure their applications as multi-agent 
systems; and (2) the lack of widely available industrial-strength multi-agent sys- 
tem toolkits. 

With these observations in mind, we propose a general-purpose agent-based 
soft computing society. Such a society can be applied to complex, real-world 
problems that can be modeled with multi-agent, and at the same time, different 
SC techniques must be employed to solve the problems. With the support of the 
soft computing agent society, the multi-agent system developers need only to 
build the domain-specific parts and construct the ontologies used in the specific 
application field-rather than re-inventing the wheel as often happens at the 
moment. In this paper, we will discuss the design and implementation of such a 
soft computing agent society as well as other relevant issues. 

This research was initially motivated by our ongoing financial investment 
advisor project. In this project, we adopted a multi-agent system architecture. A 
multi-agent system approach is natural for financial investment advisor because 
of the multiplicity of information sources and different expertise that must be 
brought to bear to produce a good recommendation (such as a stock buy or sell 
decision). Meanwhile, there are many successful applications of SC technologies 
in financial sector[3][4][5]. With these observations in mind, we integrated some 
SC technologies into our financial investment advice multi-agent system. We 
have discussed the approaches to incorporating SC technologies into the financial 
investment planning multi-agent systems[l][2]. The emphasis of this paper is 
to extend our approaches used in the project and try to provide a universal 
framework to incorporate different soft computing technologies into multi-agent 
systems. 

The remainder of the paper is structured as follows. Section 2 is the frame- 
work of the agent-based soft computing society. Section 3 is the design and im- 
plementation details, which include technologies used to develop the SC agent 
society, behaviors of different agents in the society, modeling and implementa- 
tion details of these different kinds of agents, and an example of the society. 
Section 4 is a brief evaluation of the SC agent society. Finally, Section 5 is the 
concluding remarks. 
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Fig. 1. Framework of Agent-Based SC Society 



2 Framework of Agent-Based SC Society 

Because each SC technique has particular strengths and weaknesses and that 
they cannot be applied universally to every problem, this has encouraged the 
hybridization of these SC techniques. In recent years, an increasing number of 
researchers have been working in the field of hybrid systems in an attempt to 
find new ways to integrate two or more technologies to tackle complex real world 
problems [6] [7]. Some of the research work involved in multi-agent systems. Some 
typical hybrid multi-agent systems include the MIX multi-agent platform [9], the 
IMAHDA architecture[10], and the PREDICTOR system ([6], Chapter 9) etc. 

By analyzing these hybrid multi-agent systems, we find out that the way 
for integrating SC technologies into multi-agent systems in these systems is to 
embed the SC technologies in each individual software agent, and did not use any 
middle agents. Such approaches have the following limitations: (l)It is impossible 
to embed many SC technologies within a single agent. Otherwise, the agents will 
be overloaded. In many applications, the agents in multi-agent systems should 
be kept simple for ease of maintenance, initialization, and customization; (2)It 
is not fiexible to add more SC technologies to or delete some unwanted one from 
the multi-agent systems. For example, one software agent may be equipped with 
fuzzy logic, the other with neural network etc. In such a way, one agent can 
only have one SC capability. If we want the agent to possess two or more SC 
capabilities, we must modify the implementations. 

To overcome the drawbacks of current used approaches, we propose a new ap- 
proach to constructing intelligent hybrid systems. Figure 1 shows the framework. 
A complete system under this framework (we call it agent-based soft eomputing 
soeiety) consists of a set of problem solving agents, soft computing agents, and 
a serving agent of these two kinds of agents. Here, problem solving agents are 
agents without SC capability. They are at the front end of a multi-agent system. 
Soft computing agents are at the back end of a multi-agent system. They provide 
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problem solving agents with soft computing capabilities. The serving agent is a 
special kind of middle agents. It is similar to the facilitators discussed in [12]. 

Compared with those hybrid multi-agent systems described above, onr frame- 
work has three crucial characteristics that differentiate our work from others: (1) 
Each problem solving agent can easily access all the SC techniques available in 
the system; (2) The presence of the serving agent in our framework allows adap- 
tive agent organization; (3) Overall system robustness is also facilitated through 
the use of the serving agent. For example, if a particular SC service provider (SC 
agent) disappears, a requester agent (problem solving agent) can find another 
one with same/similar capabilities by interrogating the serving agent. 

3 Design and Implementation of the Society 

3.1 Technology Platform 

The most important design criterion of this serving agent is platform indepen- 
dent. All system components are developed using the Java or other technologies 
which are platform independent. The linkages among the components or vari- 
ous agents are provided by the Knowledge Query and Manipulation Language 
(KQML)[12], which encapsulates all the necessary message passing and commu- 
nication capabilities which are needed within our framework. 

The implementation of our framework is under the support of Java Agent 
Template Lite (JATLite). JATLite is a set of lightweight Java packages being 
developed at Stanford University that can be used to build multi-agent systems. 
JATLite provides a set of fully functional templates. It is written entirely in 
the Java language that supports the construction of software agents that com- 
municate using a peer-to-peer protocol. For more information on JATLite, see 
http: //java. Stanford, edu/ java-agent 

3.2 Behaviors of Different Types of Agents 

Our SC agent society has three types of agents (see Figure 1): problem solving 
agents, serving agents, and soft eomputing agents. The key component of the 
framework is the serving agent. The behavior of each kind of agent is described 
below: 

— Problem Solving Agent It is application-specific, i.e., it has its own knowl- 
edge base; It must have some meta-knowledge about when it needs the help 
of soft computing agents (e.g., pre or post processing some data); It can ask 
soft computing agents to accomplish some subtasks. 

— Soft Computing Serving Agent It works as an agent name server (ANS) 
and matchmaker of the capabilities of SC agents; It keeps track of the names, 
ontologies, and abilities of all registered soft computing agents in the sys- 
tem; It can reply the query of problem solving agent with appropriate soft 
computing agent’s name and ontology. 
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— Soft Computing Agent Each soft computing agent can provide service 
for problem solving agents with one or some kind of combined soft comput- 
ing technologies; It can send back the processed results to problem solving 
agents; It must advertise its abilities to the serving agent. 

All problem solving agents or SC agents must register and connect to the 
serving agent. 

Each problem solving agent has its own domain-specific knowledge base as 
well as meta-knowledge about when to use soft computing agents. The serving 
agent records the capabilities, ontologies, and names etc. of all the SC agents in 
a multi-agent system. The scenario goes as follows: 

At certain stage of the problem solving process, the problem solving agent 
sends a KQML message using recommend- one performative to the serving agent 
according to its meta-knowledge. The serving agent then retrieves its SC agent 
database and replies with an appropriate SC agent’s name and ontology which 
has the capability asked for using reply performative. After that, the problem 
solving agent communicates with the SC agent directly for a specific problem: 
The problem solving agent provides the SC agent some parameters according 
to the ontology^ and the SC agent sends the final results to the problem solving 
agent. 

Under our framework, the types of problems the problem solving agents can 
solve depend on their domain-specific knowledge. 



3.3 Modeling and Implementation 

The three kinds of agents described in Section 3.2 have different models. Eigure 
Eigure 2 shows the internal structures of the three kinds of agents. 

As we can see in Eigure 2, all the agents have a common part-KQML Message 
Interpreter (KMI). That is because we use KQML for inter-agent communication. 
Also because of this, we call the three kinds of entities in Eigure 1 “agents” [11]. 
The KMI represents the interface between KQML router and agents. Once an 
incoming KQML message is detected, it will be passed to the KMI. The KMI 
transfers incoming KQML messages into a form that agents can understand. The 
Implementation of KMI is based on JATLite KQMLLayer Templates. 

Both problem solving agents and soft computing agents have ontology inter- 
preter. They need to decrypt and process the : content part of the KQML message 
when they solve a problem. There is no ontology interpreter in the serving agent 
because it does not care about the :eontent When implementing, we maintain 
an Ontology Interpreter HasliTable. The table instructs an agent to locate 
the necessary interpreter for every ontology in the KQML message. It should 
be noted that an interpreter can be located at any Internet site other than the 
place where the agent resides. 

The domain knowledge in problem solving agents usually is not sufficient 
to solve a problem. They need the help of soft computing agents. The meta- 
knowledge in problem solving agents tell them when to ask for helps of soft 
computing agents. 
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Fig. 2. Agent Models in the Society 



The SC agent maintenance module in serving agent has three functions: 
Add an entry that contains the SC agent’s name, ability, and ontology to the 
database; Delete an entry from the database; and retrieve the database to find 
out SC agents with specific ability. 

For the soft computing algorithms in soft computing agents, if the agent is 
under our control, it will be built using KQML as a communication language. If 
not, we use Java Native method to connect the legacy system to our agent. 

3.4 Example 

We present an example of how the agent-based soft computing society is used 
in the determination of a user’s investment policy in our ongoing project. 

The problem solving agent receives “Determining Investment Policy (aggres- 
sive or conservative)” goal in messages coming from other problem solving agents 
or from a user interface directly. To make such a decision, the problem solving 
agent needs the information about the user’s risk tolerance (RT) ability, the 
falling or rising of interest rates (Fi), the state of the stock market (F 2 )? and 
unemployment rate (F3) etc. The problem solving agent has rules in its domain 
knowledge base such as 

If RT is H and Pi is Bi and . . . then IP is C 

where (7 is a fuzzy subset indicating the aggression or conservation of the invest- 
ment policy. H and Bi are also fuzzy subsets. The problem solving agent also 
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Fig. 3. Practical Architecture for Determining Investment Policy 



has meta knowledge such as using SC agents to evaluate user's RT and using 
SC agents to prediet Pi etc. 

Thus, the problem solving agent sends KQAIL messages using recommend- 
one performative to the serving agent. 

The serving agent then retrieves its SC agent database and replies with an 
appropriate SC agent’s name and ontology which has the capability asked for 
using reply performative. In our system, there are a risk tolerance ability evalu- 
ation agent based on fuzzy logic and an interest rate prediction agent based on 
genetic algorithms. 

The problem solving agent then communicates with SC_agent_FL (for risk 
tolerance evaluation) and SC .agent _G A (for interest rate prediction) directly. 
They decrypt and process the parameters or results (the :content part of the 
KQML messages) by using the ontology interpreters. 

After the problem solving agent obtains the results of RT and Pi etc. from 
corresponding SC agents, it can infer the conclusion about the user’s investment 
policy according to its domain knowledge. The practical architecture we adopted 
is shown in Figure 3 (under the support of JATLite). 



4 Evaluation of the SC Agent Society 



In our framework, the implementations are platform independent. The serving 
agent is general purpose, thus can be used in any applications. To tailor the 
SC agent society for other specific applications, we need to develop the domain- 
specific knowledge bases as well as the meta knowledge bases of the problem 
solving agents. In the meantime, we also need to construct the ontologies used 
in the specific applications. It is easy to wrap the legacy soft computing programs 
and convert them to “agents” by using Java Native Method and JATLite Tem- 
plates. Intelligent hybrid (multi-agent) system developers can use our framework 
for reference. They can easily follow the ideas and construct their own applica- 
tion systems with fiexibility. 
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5 Concluding Remarks 

We presented a flexible framework-the agent-based soft computing society to 
construct multi-agent application systems that need to incorporate different 
kinds of soft computing technologies into them. Our framework has three crucial 
characteristics that differentiate our work from others: (1) Our approach makes 
every problem solving agent easily access all the SC technologies available in the 
system; (2) The presence of the serving agent in our framework allows adaptive 
agent organization, that is, our framework has the ability to add and delete SC 
agents dynamically as needed; (3) Overall system robustness is also facilitated 
through the use of the serving agent. 

Such an SC agent society facilitates the design of robust, flexible and adaptive 
hybrid intelligent systems-we can build intelligent hybrid multi-agent systems 
based on the society, rather than from scratch. 

To facilitate the construction of multi-agent application systems in flnance, 
we are currently constructing the ontologies used in flnance. 
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